Perfect Hash Table Lookup with Unknown Keys?

  • Thread starter rossj81
  • Start date
  • Tags
    Table
In summary, Ross is looking for a perfect hash function that can reduce the number of lookups required for an indexed table. He is concerned that the index size is too small for the number of keys proposed. Jeff suggests using a normal hash algorithm with a pointer to a sort thread that includes an example source code.
  • #1
rossj81
4
0
Hi all

Basically, I'm looking for a fingerprinting or perfect hash function.
The key is 40 alpha characters, specifically \b[A-Z_]{40}\b.
The index is a 12-digit integer.

  • The function must be deterministic, i.e. if a key maps to an index, it must always map to the same index.
  • The function must not be surjective, i.e. only one key must map to each index, but there may be some indexes which do not have corresponding keys.
  • The function does not need to be cryptographic.

Firstly, is it even possible to map this many keys to a maximum of 1x10^12 indexes?

If it is, I would appreciate any help on working out the algorithm. The end product has to be in Natural, but I should be able to translate from any C-style language.

Thanks so much!
Ross
 
Technology news on Phys.org
  • #2
>only one key must map to each index

26^40 is greater than 10^12 so this part is not possible.
 
  • #3
You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.
 
  • #4
A trillion 40 byte keys? 40 tera-bytes of data just for the keys?
 
  • #5
Borek said:
You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.

It looked to me like the space character was included, which would require 191 bits.
 
  • #6
Thanks for the fast responses, guys.
I was afraid that the index size was too small, but I wanted someone to confirm my suspicions. O(n) it is, then - brute force search to the rescue!

Thanks again for your help.
 
  • #7
There's no reason you couldn't use a normal hash algortihm. Do something like a CRC or other remainder type function to produce an initial index. Then do a lookup with this index and check for a match, a mis-match, or an empty slot. If it's a match you found your key. If it's a mis-match, add some fixed offset, usually a prime number modulo the remainder function and repeat the lookup. If it's an empty slot, the key isn't there.

You could sort the keys and use a binary search, but sorting a trillion keys is going to take quite a while.
 
  • #8
CRGreathouse said:
It looked to me like the space character was included, which would require 191 bits.

Yes, missed that. Whenever I see space before square bracket on forums I think it is there to help in proper formatting, as forum soft has a tendency to treat such situations as some closing tag.
 
  • #9
Can't you use balanced BST for a O(log(n)) access?
 
  • #10
Lookup the term 'avalanche' with regard to hashing.

There are very few 'perfect hashes' for a lot of reasons. High performance applications use hash indexes (example: Oracle) with expectations:
1. trade off making a calculation and a few lookups and compares for long, involved disk or memory lookups and compares
2. expect there to be some few collisions that resolve very quickly.

So, what you want is a short list of pointers in a hash table entry point (collisions) that can be resolved with a few compares.

Perfect hash discovery is not worth the programmer's time or resources required most of the time. Perfect hashes do exist when data sets are small and all items are known ahead of time, but it is more of a holy grail than a practical attainable goal.
 
  • #11
Thanks for all the suggestions.

While non-perfect hashes are a good way to reduce the required number of probes, in my case I don't believe it's worth the effort. The table isn't big enough to justify the added complexity. I just thought it would be cool to reduce the look-up to O(1).

Ross
 
  • #12
rossj81 said:
The table isn't big enough to justify the added complexity.
You stated that the index was 12 digits, which made me think that there were 1 trillion (possible) elements in the table. Perhaps you meant each table entry has a 40 character key, and a 12 digit value? Just how big is this table?
 
  • #13
Jeff

Actually, the upper limit would be based on the alpha key. I don't remember, so I looked on Wikipedia. If I understand correctly, 27 different possible characters in 40 places would be
Code:
P(n,r) = n! / (n - r)!
       = 40! / 13!
I'm not even going to try simplify, but I think it's even bigger than 10^12...

So the theoretical maximum number of elements is some _really_ big number, but I doubt that there will be more than 400 elements once implemented. Hence I am prepared to to accept O(n)...

Ross
 
  • #14
The ordering of the characters in the key is significant, so the number of possible keys is 2740. You mentioned that index is 12 digits, and if this was an index to the file, this would limint the number of possible file entries to 1012.

If there are only 400 elements, you could sort them and then binary search O(log2(n)). Link to sort thread, including example source code for vector and text file sorting. If you have standard template library, then stable_sort() would be the easiest to implement.

sorting revisted thread.htm
 
Last edited:
  • #15
Disclaimer: I don't know STL (nor English for that matter), so I can be using some weird terminology.

Why don't you use map or dictionary? If they are based on the balanced BST I have already mentioned, they can be built on the run, modified any time, and access is always logarithmic. I am more then sure they are implemented in STL so you don't need to do it by yourself. No need for sorting then.
 
  • #16
Borek said:
Standard Template Library - map class?
Looks like map class would be the best fit. The standard doesn't specify how a map is to be implemented, just that the elements are sorted by key. Wiki states that set class is implemented as a balanced binary search tree, but doesn't mention the implementation for map, but it might be the same.

http://en.wikipedia.org/wiki/Standard_Template_Library
 
Last edited:

1. What is a perfect hash table?

A perfect hash table is a type of data structure that allows for efficient lookup of values based on a key. It is called "perfect" because it has no collisions, meaning each key maps to a unique index in the table.

2. How does a perfect hash table handle unknown keys?

In a perfect hash table, unknown keys can be handled by using a two-level hashing approach. This involves using a primary hash function to map keys to buckets, and then using a secondary hash function to map keys within each bucket to a specific index. This ensures that even unknown keys will have a unique index in the table.

3. What are the advantages of using a perfect hash table?

The main advantage of using a perfect hash table is its efficiency in lookup operations. Since each key maps to a unique index, there are no collisions and therefore no need for additional comparisons or calculations. This makes lookup operations much faster compared to other types of hash tables.

4. How is a perfect hash table created?

A perfect hash table is created by first identifying all the possible keys that will be used in the table. Then, a primary hash function is chosen to map these keys to buckets. Next, a secondary hash function is chosen to map the keys within each bucket to a specific index. This process may involve trial and error to find the most optimal hash functions.

5. What are the limitations of using a perfect hash table?

One limitation of using a perfect hash table is the time and effort required to create it. Finding the most optimal hash functions can be a complex and time-consuming process. Additionally, if the set of keys being used in the table changes, the hash functions may need to be recalculated. This can be a challenging task for large and constantly changing datasets.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
4
Views
2K
  • Programming and Computer Science
Replies
1
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • Programming and Computer Science
Replies
29
Views
2K
  • Math Proof Training and Practice
2
Replies
67
Views
10K
Replies
7
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
7
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
7
Views
3K
Back
Top