Optimizing Hash Table Storage for Efficient Record Retrieval

  • Context: Comp Sci 
  • Thread starter Thread starter JonnyG
  • Start date Start date
  • Tags Tags
    Table
Click For Summary
SUMMARY

This discussion focuses on optimizing hash table storage for efficient record retrieval, specifically addressing the challenge of storing a hash table on disk while maintaining performance. The hash table is required to hold approximately 100 records, each 128 bytes, resulting in a total size of about 12.5 MB. The consensus is that while it may be tempting to store references in main memory to avoid disk access, the hash table must be stored as a single file on disk, adhering to the requirements. The use of the fseek() command is recommended for managing file access efficiently.

PREREQUISITES
  • Understanding of hashing algorithms and their implementation
  • Familiarity with file handling in C++, particularly the fseek() function
  • Knowledge of memory management and storage optimization techniques
  • Basic concepts of data structures, specifically hash tables
NEXT STEPS
  • Research efficient file I/O operations in C++ using fseek() and fread()
  • Explore advanced hashing techniques and collision resolution methods
  • Learn about memory-mapped files for improved performance in data retrieval
  • Investigate the impact of data structure design on performance in large datasets
USEFUL FOR

Software developers, particularly those working with data storage solutions, systems programmers, and anyone involved in optimizing data retrieval processes in applications.

JonnyG
Messages
233
Reaction score
45
Homework Statement
Implement a database stored on disk using bucket hashing. Define records to
be 128 bytes long with a 4-byte key and 120 bytes of data. The remaining
4 bytes are available for you to store necessary information to support the
hash table. A bucket in the hash table will be 1024 bytes long, so each bucket
has space for 8 records. The hash table should consist of 27 buckets (total
space for 216 records with slots indexed by positions 0 to 215) followed by
the overflow bucket at record position 216 in the file. The hash function for
key value K should be K mod 213. (Note that this means the last three
slots in the table will not be home positions for any record.) The collision
resolution function should be linear probing with wrap-around within the
bucket. For example, if a record is hashed to slot 5, the collision resolution
process will attempt to insert the record into the table in the order 5, 6, 7, 0,
1, 2, 3, and finally 4. If a bucket is full, the record should be placed in the
overflow section at the end of the file.

Your hash table should implement the dictionary ADT of Section 4.4. When
you do your testing, assume that the system is meant to store about 100 or so
records at a time.
Relevant Equations
No relevant equations. I want to implement this hash table in c++
I understand hashing (at least in theory), but I am struggling with actual implementation. I was wondering what the best way to go about this problem was, and this just led to more questions. Now I am confused.

1) The database has to be stored on disk. Let's say it's stored as a folder of individual files, with each file containing a record. But where is the actual hash table stored? The question states that each record is to be 128 bytes long and that the hash table should hold about 100 records. So the hash table will be about 12.5 MB in size. I can fit this in main memory but the question asks me to store the hash table on disk. But then every time I want to search for a record, I would have to make a disk access, which would be relatively slow.

Wouldn't it be better if the hash table was stored in main memory as an array of references to the files instead? (maybe store a string that contains the path and filename). A string would be 32 bytes (I used the sizeof operator in c++ to check this) and so an array of 100 strings would easily fit into main memory. This would mean that there are no disk accesses when searching in the hash table and so the program should be faster and would take up less space than if I store the entire actual records in the array (where each record is 128 byes).

Am I thinking about this the wrong way?
 
Last edited:
Physics news on Phys.org
The hash table can be stored on disk when it is modified and saved, but it can be loaded into memory for interactive use.
 
JonnyG said:
Let's say it's stored as a folder of individual files, with each file containing a record.
That's not reasonable -- it should be stored as a single file that contains the 27 buckets with their 216 records.
JonnyG said:
But where is the actual hash table stored? The question states that each record is to be 128 bytes long and that the hash table should hold about 100 records. So the hash table will be about 12.5 MB in size.
It's stored on disk -- that's one of the requirements. The idea here is to have an idea of how to work with data that could conceivably be too large to fit into main memory.
JonnyG said:
Wouldn't it be better if the hash table was stored in main memory as an array of references to the files instead?
No. For one thing your program would need to keep track of the multitude of files. For another thing, your program really needs to follow the requirements that are given.
 
  • Like
Likes   Reactions: FactChecker
The hash table and data should be stored as a single file that is read in, modified in memory, and saved when appropriate.
 
  • Like
Likes   Reactions: JonnyG
Hint: fseek() command.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 21 ·
Replies
21
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 17 ·
Replies
17
Views
6K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K