Perfect Hash Table Lookup with Unknown Keys?

  • Thread starter Thread starter rossj81
  • Start date Start date
  • Tags Tags
    Table
Click For Summary

Discussion Overview

The discussion revolves around the feasibility of creating a perfect hash function for a set of keys that are 40 alpha characters long, specifically formatted as \b[A-Z_]{40}\b, with the goal of mapping them to a maximum of 1 trillion (12-digit) integer indexes. Participants explore the implications of this mapping, including the requirements for determinism and non-surjectiveness, while considering the computational and storage challenges involved.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants argue that mapping 26^40 keys to 10^12 indexes is impossible due to the sheer number of potential keys exceeding the available indexes.
  • Others suggest that approximately 188 bits would be needed for the keys, and that including a space character would increase this requirement to 191 bits.
  • One participant proposes using a normal hash algorithm, suggesting a CRC or remainder function for initial indexing, followed by collision resolution techniques.
  • Another participant mentions the potential for using balanced binary search trees (BST) for O(log(n)) access, questioning the necessity of a perfect hash function.
  • Some participants note that perfect hashes are often impractical for large datasets, with one stating that they are more of a "holy grail" than a practical goal.
  • Ross expresses a willingness to accept O(n) complexity due to the anticipated small number of elements (around 400), despite initially seeking O(1) lookup times.
  • There is a discussion about the size and structure of the table, with some confusion regarding whether the index refers to the number of elements or the size of the keys.
  • One participant suggests using the Standard Template Library's map or dictionary structures, which could provide logarithmic access without the need for sorting.

Areas of Agreement / Disagreement

Participants generally agree that the proposed mapping is challenging due to the number of potential keys versus the available indexes. However, there is no consensus on the best approach to take, with multiple competing views on the use of hashing, binary search, and data structures like maps or BSTs.

Contextual Notes

Limitations include the assumptions about the number of keys and the size of the index, as well as the unresolved details regarding the implementation and efficiency of various proposed methods.

Who May Find This Useful

This discussion may be useful for software developers, computer scientists, and mathematicians interested in hashing techniques, data structures, and algorithmic efficiency in the context of large datasets.

rossj81
Messages
4
Reaction score
0
Hi all

Basically, I'm looking for a fingerprinting or perfect hash function.
The key is 40 alpha characters, specifically \b[A-Z_]{40}\b.
The index is a 12-digit integer.

  • The function must be deterministic, i.e. if a key maps to an index, it must always map to the same index.
  • The function must not be surjective, i.e. only one key must map to each index, but there may be some indexes which do not have corresponding keys.
  • The function does not need to be cryptographic.

Firstly, is it even possible to map this many keys to a maximum of 1x10^12 indexes?

If it is, I would appreciate any help on working out the algorithm. The end product has to be in Natural, but I should be able to translate from any C-style language.

Thanks so much!
Ross
 
Technology news on Phys.org
>only one key must map to each index

26^40 is greater than 10^12 so this part is not possible.
 
You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.
 
A trillion 40 byte keys? 40 tera-bytes of data just for the keys?
 
Borek said:
You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.

It looked to me like the space character was included, which would require 191 bits.
 
Thanks for the fast responses, guys.
I was afraid that the index size was too small, but I wanted someone to confirm my suspicions. O(n) it is, then - brute force search to the rescue!

Thanks again for your help.
 
There's no reason you couldn't use a normal hash algortihm. Do something like a CRC or other remainder type function to produce an initial index. Then do a lookup with this index and check for a match, a mis-match, or an empty slot. If it's a match you found your key. If it's a mis-match, add some fixed offset, usually a prime number modulo the remainder function and repeat the lookup. If it's an empty slot, the key isn't there.

You could sort the keys and use a binary search, but sorting a trillion keys is going to take quite a while.
 
CRGreathouse said:
It looked to me like the space character was included, which would require 191 bits.

Yes, missed that. Whenever I see space before square bracket on forums I think it is there to help in proper formatting, as forum soft has a tendency to treat such situations as some closing tag.
 
Can't you use balanced BST for a O(log(n)) access?
 
  • #10
Lookup the term 'avalanche' with regard to hashing.

There are very few 'perfect hashes' for a lot of reasons. High performance applications use hash indexes (example: Oracle) with expectations:
1. trade off making a calculation and a few lookups and compares for long, involved disk or memory lookups and compares
2. expect there to be some few collisions that resolve very quickly.

So, what you want is a short list of pointers in a hash table entry point (collisions) that can be resolved with a few compares.

Perfect hash discovery is not worth the programmer's time or resources required most of the time. Perfect hashes do exist when data sets are small and all items are known ahead of time, but it is more of a holy grail than a practical attainable goal.
 
  • #11
Thanks for all the suggestions.

While non-perfect hashes are a good way to reduce the required number of probes, in my case I don't believe it's worth the effort. The table isn't big enough to justify the added complexity. I just thought it would be cool to reduce the look-up to O(1).

Ross
 
  • #12
rossj81 said:
The table isn't big enough to justify the added complexity.
You stated that the index was 12 digits, which made me think that there were 1 trillion (possible) elements in the table. Perhaps you meant each table entry has a 40 character key, and a 12 digit value? Just how big is this table?
 
  • #13
Jeff

Actually, the upper limit would be based on the alpha key. I don't remember, so I looked on Wikipedia. If I understand correctly, 27 different possible characters in 40 places would be
Code:
P(n,r) = n! / (n - r)!
       = 40! / 13!
I'm not even going to try simplify, but I think it's even bigger than 10^12...

So the theoretical maximum number of elements is some _really_ big number, but I doubt that there will be more than 400 elements once implemented. Hence I am prepared to to accept O(n)...

Ross
 
  • #14
The ordering of the characters in the key is significant, so the number of possible keys is 2740. You mentioned that index is 12 digits, and if this was an index to the file, this would limint the number of possible file entries to 1012.

If there are only 400 elements, you could sort them and then binary search O(log2(n)). Link to sort thread, including example source code for vector and text file sorting. If you have standard template library, then stable_sort() would be the easiest to implement.

sorting revisted thread.htm
 
Last edited:
  • #15
Disclaimer: I don't know STL (nor English for that matter), so I can be using some weird terminology.

Why don't you use map or dictionary? If they are based on the balanced BST I have already mentioned, they can be built on the run, modified any time, and access is always logarithmic. I am more then sure they are implemented in STL so you don't need to do it by yourself. No need for sorting then.
 
  • #16
Borek said:
Standard Template Library - map class?
Looks like map class would be the best fit. The standard doesn't specify how a map is to be implemented, just that the elements are sorted by key. Wiki states that set class is implemented as a balanced binary search tree, but doesn't mention the implementation for map, but it might be the same.

http://en.wikipedia.org/wiki/Standard_Template_Library
 
Last edited:

Similar threads

Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 5 ·
Replies
5
Views
3K
Replies
29
Views
6K
  • · Replies 7 ·
Replies
7
Views
3K
Replies
5
Views
3K
  • · Replies 67 ·
3
Replies
67
Views
17K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 7 ·
Replies
7
Views
4K