# Perfect Hash Table Lookup with Unknown Keys?

## Main Question or Discussion Point

Hi all

Basically, I'm looking for a fingerprinting or perfect hash function.
The key is 40 alpha characters, specifically \b[A-Z_]{40}\b.
The index is a 12-digit integer.

• The function must be deterministic, i.e. if a key maps to an index, it must always map to the same index.
• The function must not be surjective, i.e. only one key must map to each index, but there may be some indexes which do not have corresponding keys.
• The function does not need to be cryptographic.

Firstly, is it even possible to map this many keys to a maximum of 1x10^12 indexes?

If it is, I would appreciate any help on working out the algorithm. The end product has to be in Natural, but I should be able to translate from any C-style language.

Thanks so much!
Ross

Related Programming and Computer Science News on Phys.org
rcgldr
Homework Helper
>only one key must map to each index

26^40 is greater than 10^12 so this part is not possible.

Borek
Mentor
You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.

rcgldr
Homework Helper
A trillion 40 byte keys? 40 tera-bytes of data just for the keys?

CRGreathouse
Homework Helper
You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.
It looked to me like the space character was included, which would require 191 bits.

Thanks for the fast responses, guys.
I was afraid that the index size was too small, but I wanted someone to confirm my suspicions. O(n) it is, then - brute force search to the rescue!!

rcgldr
Homework Helper
There's no reason you couldn't use a normal hash algortihm. Do something like a CRC or other remainder type function to produce an initial index. Then do a lookup with this index and check for a match, a mis-match, or an empty slot. If it's a match you found your key. If it's a mis-match, add some fixed offset, usually a prime number modulo the remainder function and repeat the lookup. If it's an empty slot, the key isn't there.

You could sort the keys and use a binary search, but sorting a trillion keys is going to take quite a while.

Borek
Mentor
It looked to me like the space character was included, which would require 191 bits.
Yes, missed that. Whenever I see space before square bracket on forums I think it is there to help in proper formatting, as forum soft has a tendency to treat such situations as some closing tag.

Borek
Mentor
Can't you use balanced BST for a O(log(n)) access?

jim mcnamara
Mentor
Lookup the term 'avalanche' with regard to hashing.

There are very few 'perfect hashes' for a lot of reasons. High performance applications use hash indexes (example: Oracle) with expectations:
1. trade off making a calculation and a few lookups and compares for long, involved disk or memory lookups and compares
2. expect there to be some few collisions that resolve very quickly.

So, what you want is a short list of pointers in a hash table entry point (collisions) that can be resolved with a few compares.

Perfect hash discovery is not worth the programmer's time or resources required most of the time. Perfect hashes do exist when data sets are small and all items are known ahead of time, but it is more of a holy grail than a practical attainable goal.

Thanks for all the suggestions.

While non-perfect hashes are a good way to reduce the required number of probes, in my case I don't believe it's worth the effort. The table isn't big enough to justify the added complexity. I just thought it would be cool to reduce the look-up to O(1).

Ross

rcgldr
Homework Helper
The table isn't big enough to justify the added complexity.
You stated that the index was 12 digits, which made me think that there were 1 trillion (possible) elements in the table. Perhaps you meant each table entry has a 40 character key, and a 12 digit value? Just how big is this table?

Jeff

Actually, the upper limit would be based on the alpha key. I don't remember, so I looked on Wikipedia. If I understand correctly, 27 different possible characters in 40 places would be
Code:
P(n,r) = n! / (n - r)!
= 40! / 13!
I'm not even going to try simplify, but I think it's even bigger than 10^12...

So the theoretical maximum number of elements is some _really_ big number, but I doubt that there will be more than 400 elements once implemented. Hence I am prepared to to accept O(n)...

Ross

rcgldr
Homework Helper
The ordering of the characters in the key is significant, so the number of possible keys is 2740. You mentioned that index is 12 digits, and if this was an index to the file, this would limint the number of possible file entries to 1012.

If there are only 400 elements, you could sort them and then binary search O(log2(n)). Link to sort thread, including example source code for vector and text file sorting. If you have standard template library, then stable_sort() would be the easiest to implement.

Last edited:
Borek
Mentor
Disclaimer: I don't know STL (nor English for that matter), so I can be using some weird terminology.

Why don't you use map or dictionary? If they are based on the balanced BST I have already mentioned, they can be built on the run, modified any time, and access is always logarithmic. I am more then sure they are implemented in STL so you don't need to do it by yourself. No need for sorting then.

rcgldr
Homework Helper
Standard Template Library - map class?
Looks like map class would be the best fit. The standard doesn't specify how a map is to be implemented, just that the elements are sorted by key. Wiki states that set class is implemented as a balanced binary search tree, but doesn't mention the implementation for map, but it might be the same.

http://en.wikipedia.org/wiki/Standard_Template_Library

Last edited: