Perfect Hash Table Lookup with Unknown Keys?

rossj81 · May 10, 2008

Hi all

Basically, I'm looking for a fingerprinting or perfect hash function.
The key is 40 alpha characters, specifically \b[A-Z_]{40}\b.
The index is a 12-digit integer.

The function must be deterministic, i.e. if a key maps to an index, it must always map to the same index.
The function must not be surjective, i.e. only one key must map to each index, but there may be some indexes which do not have corresponding keys.
The function does not need to be cryptographic.

Firstly, is it even possible to map this many keys to a maximum of 1x10^12 indexes?

If it is, I would appreciate any help on working out the algorithm. The end product has to be in Natural, but I should be able to translate from any C-style language.

Thanks so much!
Ross

rcgldr · May 10, 2008

>only one key must map to each index

26^40 is greater than 10^12 so this part is not possible.

Borek · May 10, 2008

You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.

rcgldr · May 10, 2008

A trillion 40 byte keys? 40 tera-bytes of data just for the keys?

CRGreathouse · May 10, 2008

Borek said:

You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.

It looked to me like the space character was included, which would require 191 bits.

rossj81 · May 11, 2008

Thanks for the fast responses, guys.
I was afraid that the index size was too small, but I wanted someone to confirm my suspicions. O(n) it is, then - brute force search to the rescue!

Thanks again for your help.

rcgldr · May 11, 2008

There's no reason you couldn't use a normal hash algortihm. Do something like a CRC or other remainder type function to produce an initial index. Then do a lookup with this index and check for a match, a mis-match, or an empty slot. If it's a match you found your key. If it's a mis-match, add some fixed offset, usually a prime number modulo the remainder function and repeat the lookup. If it's an empty slot, the key isn't there.

You could sort the keys and use a binary search, but sorting a trillion keys is going to take quite a while.

Borek · May 11, 2008

CRGreathouse said:

It looked to me like the space character was included, which would require 191 bits.

Yes, missed that. Whenever I see space before square bracket on forums I think it is there to help in proper formatting, as forum soft has a tendency to treat such situations as some closing tag.

Borek · May 11, 2008

Can't you use balanced BST for a O(log(n)) access?

jim mcnamara · May 12, 2008

Lookup the term 'avalanche' with regard to hashing.

There are very few 'perfect hashes' for a lot of reasons. High performance applications use hash indexes (example: Oracle) with expectations:
1. trade off making a calculation and a few lookups and compares for long, involved disk or memory lookups and compares
2. expect there to be some few collisions that resolve very quickly.

So, what you want is a short list of pointers in a hash table entry point (collisions) that can be resolved with a few compares.

Perfect hash discovery is not worth the programmer's time or resources required most of the time. Perfect hashes do exist when data sets are small and all items are known ahead of time, but it is more of a holy grail than a practical attainable goal.

rossj81 · May 14, 2008

Thanks for all the suggestions.

While non-perfect hashes are a good way to reduce the required number of probes, in my case I don't believe it's worth the effort. The table isn't big enough to justify the added complexity. I just thought it would be cool to reduce the look-up to O(1).

Ross

rcgldr · May 14, 2008

rossj81 said:

The table isn't big enough to justify the added complexity.

You stated that the index was 12 digits, which made me think that there were 1 trillion (possible) elements in the table. Perhaps you meant each table entry has a 40 character key, and a 12 digit value? Just how big is this table?

rossj81 · May 15, 2008

Jeff

Actually, the upper limit would be based on the alpha key. I don't remember, so I looked on Wikipedia. If I understand correctly, 27 different possible characters in 40 places would be

Code:

P(n,r) = n! / (n - r)!
       = 40! / 13!

I'm not even going to try simplify, but I think it's even bigger than 10^12...

So the theoretical maximum number of elements is some _really_ big number, but I doubt that there will be more than 400 elements once implemented. Hence I am prepared to to accept O(n)...

Ross

rcgldr · May 15, 2008

The ordering of the characters in the key is significant, so the number of possible keys is 27⁴⁰. You mentioned that index is 12 digits, and if this was an index to the file, this would limint the number of possible file entries to 10¹².

If there are only 400 elements, you could sort them and then binary search O(log₂(n)). Link to sort thread, including example source code for vector and text file sorting. If you have standard template library, then stable_sort() would be the easiest to implement.

sorting revisted thread.htm

Borek · May 15, 2008

Disclaimer: I don't know STL (nor English for that matter), so I can be using some weird terminology.

Why don't you use map or dictionary? If they are based on the balanced BST I have already mentioned, they can be built on the run, modified any time, and access is always logarithmic. I am more then sure they are implemented in STL so you don't need to do it by yourself. No need for sorting then.

rcgldr · May 15, 2008

Borek said:

Standard Template Library - map class?

Looks like map class would be the best fit. The standard doesn't specify how a map is to be implemented, just that the elements are sorted by key. Wiki states that set class is implemented as a balanced binary search tree, but doesn't mention the implementation for map, but it might be the same.

http://en.wikipedia.org/wiki/Standard_Template_Library

Perfect Hash Table Lookup with Unknown Keys?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

File Structure vs Data Structure

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight