Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Perfect Hash Table Lookup with Unknown Keys?

  1. May 10, 2008 #1
    Hi all

    Basically, I'm looking for a fingerprinting or perfect hash function.
    The key is 40 alpha characters, specifically \b[A-Z_]{40}\b.
    The index is a 12-digit integer.

    • The function must be deterministic, i.e. if a key maps to an index, it must always map to the same index.
    • The function must not be surjective, i.e. only one key must map to each index, but there may be some indexes which do not have corresponding keys.
    • The function does not need to be cryptographic.

    Firstly, is it even possible to map this many keys to a maximum of 1x10^12 indexes?

    If it is, I would appreciate any help on working out the algorithm. The end product has to be in Natural, but I should be able to translate from any C-style language.

    Thanks so much!
    Ross
     
  2. jcsd
  3. May 10, 2008 #2

    rcgldr

    User Avatar
    Homework Helper

    >only one key must map to each index

    26^40 is greater than 10^12 so this part is not possible.
     
  4. May 10, 2008 #3

    Borek

    User Avatar

    Staff: Mentor

    You will need something like 188 bits for that. Assuming you meant decimal digits integer that translates to a liitle bit more than 56 digits.
     
  5. May 10, 2008 #4

    rcgldr

    User Avatar
    Homework Helper

    A trillion 40 byte keys? 40 tera-bytes of data just for the keys?
     
  6. May 10, 2008 #5

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    It looked to me like the space character was included, which would require 191 bits.
     
  7. May 11, 2008 #6
    Thanks for the fast responses, guys.
    I was afraid that the index size was too small, but I wanted someone to confirm my suspicions. O(n) it is, then - brute force search to the rescue!!

    Thanks again for your help.
     
  8. May 11, 2008 #7

    rcgldr

    User Avatar
    Homework Helper

    There's no reason you couldn't use a normal hash algortihm. Do something like a CRC or other remainder type function to produce an initial index. Then do a lookup with this index and check for a match, a mis-match, or an empty slot. If it's a match you found your key. If it's a mis-match, add some fixed offset, usually a prime number modulo the remainder function and repeat the lookup. If it's an empty slot, the key isn't there.

    You could sort the keys and use a binary search, but sorting a trillion keys is going to take quite a while.
     
  9. May 11, 2008 #8

    Borek

    User Avatar

    Staff: Mentor

    Yes, missed that. Whenever I see space before square bracket on forums I think it is there to help in proper formatting, as forum soft has a tendency to treat such situations as some closing tag.
     
  10. May 11, 2008 #9

    Borek

    User Avatar

    Staff: Mentor

    Can't you use balanced BST for a O(log(n)) access?
     
  11. May 12, 2008 #10

    jim mcnamara

    User Avatar
    Science Advisor
    Gold Member

    Lookup the term 'avalanche' with regard to hashing.

    There are very few 'perfect hashes' for a lot of reasons. High performance applications use hash indexes (example: Oracle) with expectations:
    1. trade off making a calculation and a few lookups and compares for long, involved disk or memory lookups and compares
    2. expect there to be some few collisions that resolve very quickly.

    So, what you want is a short list of pointers in a hash table entry point (collisions) that can be resolved with a few compares.

    Perfect hash discovery is not worth the programmer's time or resources required most of the time. Perfect hashes do exist when data sets are small and all items are known ahead of time, but it is more of a holy grail than a practical attainable goal.
     
  12. May 14, 2008 #11
    Thanks for all the suggestions.

    While non-perfect hashes are a good way to reduce the required number of probes, in my case I don't believe it's worth the effort. The table isn't big enough to justify the added complexity. I just thought it would be cool to reduce the look-up to O(1).

    Ross
     
  13. May 14, 2008 #12

    rcgldr

    User Avatar
    Homework Helper

    You stated that the index was 12 digits, which made me think that there were 1 trillion (possible) elements in the table. Perhaps you meant each table entry has a 40 character key, and a 12 digit value? Just how big is this table?
     
  14. May 15, 2008 #13
    Jeff

    Actually, the upper limit would be based on the alpha key. I don't remember, so I looked on Wikipedia. If I understand correctly, 27 different possible characters in 40 places would be
    Code (Text):
    P(n,r) = n! / (n - r)!
           = 40! / 13!
    I'm not even going to try simplify, but I think it's even bigger than 10^12...

    So the theoretical maximum number of elements is some _really_ big number, but I doubt that there will be more than 400 elements once implemented. Hence I am prepared to to accept O(n)...

    Ross
     
  15. May 15, 2008 #14

    rcgldr

    User Avatar
    Homework Helper

    The ordering of the characters in the key is significant, so the number of possible keys is 2740. You mentioned that index is 12 digits, and if this was an index to the file, this would limint the number of possible file entries to 1012.

    If there are only 400 elements, you could sort them and then binary search O(log2(n)). Link to sort thread, including example source code for vector and text file sorting. If you have standard template library, then stable_sort() would be the easiest to implement.

    sorting revisted thread.htm
     
    Last edited: May 15, 2008
  16. May 15, 2008 #15

    Borek

    User Avatar

    Staff: Mentor

    Disclaimer: I don't know STL (nor English for that matter), so I can be using some weird terminology.

    Why don't you use map or dictionary? If they are based on the balanced BST I have already mentioned, they can be built on the run, modified any time, and access is always logarithmic. I am more then sure they are implemented in STL so you don't need to do it by yourself. No need for sorting then.
     
  17. May 15, 2008 #16

    rcgldr

    User Avatar
    Homework Helper

    Looks like map class would be the best fit. The standard doesn't specify how a map is to be implemented, just that the elements are sorted by key. Wiki states that set class is implemented as a balanced binary search tree, but doesn't mention the implementation for map, but it might be the same.

    http://en.wikipedia.org/wiki/Standard_Template_Library
     
    Last edited: May 15, 2008
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Perfect Hash Table Lookup with Unknown Keys?
  1. Hash Functions (Replies: 4)

Loading...