Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

C/++/# Why isn't my hash set doing a good job?

Tags:
  1. Apr 8, 2016 #1
    So I created my own implementation of a hash set of strings and am XORing the characters as a hashing function:

    Code (Text):

            private int _GetBucketNumber ( string s, List<List<string>> Buckets )
            {
                //       s: string whose index to look up
                // Buckets: source buckets

                // disallow empty or NULL strings
                if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
                if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

                // XOR characters together and mod by length of buckets
                char c = s[0];
                for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
                return (int)c % Buckets.Count;
            }
     
    Problem is that it's not doing a good job of distributing N strings into N buckets. For instance, I initialize it with 50 buckets and add in the first 50 words of http://www.math.sjsu.edu/~foster/dictionary.txt [Broken] and it ends up like

    NUHckW9.png

    Now I know that's not the best example since the first 50 words (alphabetized) are very similar in characters, but I've tested it out with random strings and it does the same thing.
     
    Last edited by a moderator: May 7, 2017
  2. jcsd
  3. Apr 8, 2016 #2

    mfb

    User Avatar
    2016 Award

    Staff: Mentor

    All the lowercase ascii characters agree in many of their bits, I guess you cannot fill more than ~30 buckets with that method. All the "a" don't make it better (note that adding two "a" to a word doesn't change the hash at all).
     
  4. Apr 8, 2016 #3

    jim mcnamara

    User Avatar

    Staff: Mentor

    Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

    gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

    Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
     
  5. Apr 8, 2016 #4

    jedishrfu

    Staff: Mentor

    Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

    If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that
     
  6. Apr 9, 2016 #5
    Yep
    They're still getting distributed into the first region of buckets and nowhere else.

    Allow me to do a full code dump so that you guys can try to help me find the flaw in my logic.

    Code (Text):

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;

    namespace StringSet
    {


        class StringSet
        {

            private List<List<string>> _Buckets;
            public int NumStrings { get; private set; }

            public StringSet ( )
            {
                this._Buckets = new List<List<string>>();
                this.NumStrings = 0;
            }

            public StringSet ( string[] S )
            {
                // better way to do this?
                this._Buckets = new List<List<string>>();
                foreach ( string s in S ) this._Buckets.Add(new List<string>());
                foreach ( string s in S ) { this.Insert(s);  }
            }

            private int _GetBucketNumber ( string s, List<List<string>> Buckets )
            {
                //       s: string whose index to look up
                // Buckets: source buckets

                // disallow empty or NULL strings
                if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
                if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

                // XOR characters together and mod by length of buckets
                char c = s[0];
                for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
                return (int)c % Buckets.Count;
            }

            private void _RehashIfNecessary ( )
            {
                // if the number of strings in the set exceeds the number of buckets,
                // increase the number of buckets to either double its current size
                // or the largest number of buckets possible, whichever is smaller
                if ( this.NumStrings > this._Buckets.Count )
                {
                    List<List<string>> NewBuckets = new List<List<string>>(Math.Min(this._Buckets.Count * 2, Int32.MaxValue));
                    foreach ( List<string> Bucket in this._Buckets )
                    {
                        foreach ( string s in Bucket )
                        {
                            NewBuckets[this._GetBucketNumber(s, NewBuckets)].Add(s);
                        }
                    }
                    this._Buckets = NewBuckets;
                }
            }

            public void Insert ( string s )
            {
                // disallow empty or NULL strings
                if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }

                // Get bucket that string belongs in
                List<string> Bucket = this._Buckets[this._GetBucketNumber(s,this._Buckets)];
                // Add if not already there
                if ( Bucket.IndexOf(s) == -1 ) { Bucket.Add(s); }
                ++NumStrings; _RehashIfNecessary();
            }
         
            public bool Contains ( string s )
            {
                // returns true or false depending on whether s is a
                // string currently in the set

                return (this._Buckets[this._GetBucketNumber(s,this._Buckets)].IndexOf(s) != -1);
            }
         
            public void Print ( )
            {
                for ( int i = 0; i < this._Buckets.Count; ++i )
                {
                    Console.WriteLine("Bucket {0}: {1}", i, string.Join(",",this._Buckets[i].ToArray()));
                }
            }

        }

        class Program
        {
            static void Main ( string[] args )
            {
             
                try
                {
                    List<string> Words = new List<string>();
                    for (char i = 'a'; i <= 'z'; ++i)
                        for (char j = 'a'; j <= 'z'; ++j)
                            for (char k = 'a'; k <= 'z'; ++k )
                                Words.Add(new string(new char[] { i, j, k }));
                    /*
                    System.IO.StreamReader file = new System.IO.StreamReader(System.AppDomain.CurrentDomain.BaseDirectory + "testfiles\\dictionary.txt");
                    string line;
                    while ( (line = file.ReadLine()) != null )
                    {
                        Words.Add(line);
                    }*/
                    StringSet TestSet = new StringSet(Words.ToArray());
                    Console.WriteLine("Number of strings: {0}", TestSet.NumStrings);
                    TestSet.Print();
                }
                catch ( Exception E )
                {
                    Console.WriteLine("Exception occured: {0}", E.Message);
                }
                Console.Read(); // just to keep console open
            }
        }
    }
     
     
  7. Apr 9, 2016 #6
    Ok, with the knowledge that the range of possible values of inputted strings is 'aaa', 'aab', ..., 'zzz', I changed my function to

    Code (Text):

            private int _GetBucketNumber ( string s, List<List<string>> Buckets )
            {
                //       s: string whose index to look up
                // Buckets: source buckets

                // disallow empty or NULL strings
                if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
                if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

                // XOR characters together and mod by length of buckets
                /*
                char c = s[0];
                for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
                return (int)c % Buckets.Count;*/
                return (676 * (int)('z' - s[0]) + 26 * (int)('z' - s[1]) + 1 * (int)('z' - s[2])) % Buckets.Count;
            }
     
    and of course that worked perfectly. But I want mine to be general enough to work arbitrary strings.
     
  8. Apr 9, 2016 #7

    mfb

    User Avatar
    2016 Award

    Staff: Mentor

    Well, xor of ascii will not give a good hash function, no matter how exactly you implement it.
    You can generalize your 3-letter approach to arbitrary strings. It is still not a good hash function, but better than xor.

    What is the purpose of that hash function? Why don't you use an established hash?
     
    Last edited: Apr 9, 2016
  9. Apr 9, 2016 #8

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.
     
    Last edited: Apr 9, 2016
  10. Apr 9, 2016 #9

    jedishrfu

    Staff: Mentor

  11. Apr 9, 2016 #10

    jim mcnamara

    User Avatar

    Staff: Mentor

    I am missing something.

    It seems you are having trouble with underlying concepts, in which case we supplied good links.
    Otherwise, If this is for a job then why not use known good algorithms. FNV hash comes mind.

    Not doing either of these is getting you nowhere. Why not read?
     
  12. Apr 9, 2016 #11
    Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations
    That eternallyconfuzzled article is really good. I implemented the Bernstein hash and it worked out well:



    Now I'm going to run a performance test.
    That eternallyconfuzzled article was very good. I tried the Bernstein hash and it worked well:

    3yc9qBW.png


    I'm going to run a performance test and I'll post back results.
     
  13. Apr 10, 2016 #12
    Alright, check this out:

    Ibyc9Q8.png

    That is the result of my performance test

    Code (Text):

            public void RunPerformanceTest ( )
            {
                // tests string hash set against regular List<string>
                // in terms of lookup

                // create list of all elements in hash set
                List<string> ListVersion = new List<string>(),
                                    Copy = new List<string>();
                foreach ( List<string> Bucket in this._Buckets )
                {
                    foreach ( string Str in Bucket )
                    {
                        ListVersion.Add(Str);
                        Copy.Add(Str);
                    }                    
                }

                // calculate average time to look up all elements
                Stopwatch Watch = new Stopwatch();
                long tList = 0, tHset = 0; // ms
                foreach ( string Str in Copy )
                {
                    // measure time to look up string in ordinary list
                    Watch.Start();
                    if ( ListVersion.Contains(Str) ) { }
                    Watch.Stop();
                    tList += Watch.ElapsedTicks;
                    // now measure time to look up same string in my hash set
                    Watch.Reset();
                    Watch.Start();
                    if ( this.Contains(Str) ) { }
                    Watch.Stop();
                    tHset += Watch.ElapsedTicks;
                    Watch.Reset();
                }
                int n = Copy.Count;
                Console.WriteLine("Average ticks to look up in List: {0}", tList / n);
                Console.WriteLine("Average ticks to look up in hashset: {0}", tHset / n);
            }
     
    Thoughts?
     
  14. Apr 10, 2016 #13

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    That is very interesting from a learning point-of-view,

    For professional programming, you should use the tools that are already in the language. Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented. They work well, take care of collisions, allocate memory, etc. The C++ map is a hash implementation. The implementations in the language will probably be optimized in ways that are hard to compete with.
     
  15. Apr 10, 2016 #14

    mfb

    User Avatar
    2016 Award

    Staff: Mentor

    XOR is the best approach if you have to treat everything bit by bit. That is an arbitrary constraint that rules out all good hash functions.
    And in general, you won't beat those implementations until you really know what you do.
     
  16. Apr 10, 2016 #15

    jim mcnamara

    User Avatar

    Staff: Mentor

  17. Apr 10, 2016 #16
    I ran the same test with an English dictionary of ~350,000 words. The test took over an hour to run and here are the results:

    JK4z1S8.png

    That's a 1460x difference in performance. Wow.
     
  18. Apr 11, 2016 #17

    mfb

    User Avatar
    2016 Award

    Staff: Mentor

    What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.
     
  19. Apr 11, 2016 #18
    That is my computed difference in performance of lookup between having the 350k words in an unordered List<string> versus the StringSet class that I created.
     
  20. Apr 11, 2016 #19

    mfb

    User Avatar
    2016 Award

    Staff: Mentor

    Ah, unordered list. Then it is not surprising.
     
  21. Apr 11, 2016 #20

    rcgldr

    User Avatar
    Homework Helper

    A CRC is sometimes used as a hash function, but the number of buckets needs to be a power of 2 to correspond with the CRC.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted