Why isn't my hash set doing a good job?

SlurrerOfSpeech · Apr 8, 2016

So I created my own implementation of a hash set of strings and am XORing the characters as a hashing function:

Code:

        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;
        }

Problem is that it's not doing a good job of distributing N strings into N buckets. For instance, I initialize it with 50 buckets and add in the first 50 words of http://www.math.sjsu.edu/~foster/dictionary.txt and it ends up like

Now I know that's not the best example since the first 50 words (alphabetized) are very similar in characters, but I've tested it out with random strings and it does the same thing.

mfb · Apr 8, 2016

All the lowercase ascii characters agree in many of their bits, I guess you cannot fill more than ~30 buckets with that method. All the "a" don't make it better (note that adding two "a" to a word doesn't change the hash at all).

jim mcnamara · Apr 8, 2016

Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

jedishrfu · Apr 8, 2016

Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that

SlurrerOfSpeech · Apr 9, 2016

Yep

jedishrfu said:

Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that

They're still getting distributed into the first region of buckets and nowhere else.

Allow me to do a full code dump so that you guys can try to help me find the flaw in my logic.

Code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace StringSet
{    class StringSet
    {

        private List<List<string>> _Buckets;
        public int NumStrings { get; private set; }

        public StringSet ( )
        {
            this._Buckets = new List<List<string>>();
            this.NumStrings = 0;
        }

        public StringSet ( string[] S )
        {
            // better way to do this?
            this._Buckets = new List<List<string>>();
            foreach ( string s in S ) this._Buckets.Add(new List<string>());
            foreach ( string s in S ) { this.Insert(s);  }
        }

        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;
        }

        private void _RehashIfNecessary ( )
        {
            // if the number of strings in the set exceeds the number of buckets,
            // increase the number of buckets to either double its current size
            // or the largest number of buckets possible, whichever is smaller
            if ( this.NumStrings > this._Buckets.Count )
            {
                List<List<string>> NewBuckets = new List<List<string>>(Math.Min(this._Buckets.Count * 2, Int32.MaxValue));
                foreach ( List<string> Bucket in this._Buckets )
                {
                    foreach ( string s in Bucket )
                    {
                        NewBuckets[this._GetBucketNumber(s, NewBuckets)].Add(s);
                    }
                }
                this._Buckets = NewBuckets;
            }
        }

        public void Insert ( string s )
        {
            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }

            // Get bucket that string belongs in
            List<string> Bucket = this._Buckets[this._GetBucketNumber(s,this._Buckets)];
            // Add if not already there
            if ( Bucket.IndexOf(s) == -1 ) { Bucket.Add(s); }
            ++NumStrings; _RehashIfNecessary();
        }
      
        public bool Contains ( string s )
        {
            // returns true or false depending on whether s is a
            // string currently in the set

            return (this._Buckets[this._GetBucketNumber(s,this._Buckets)].IndexOf(s) != -1);
        }
      
        public void Print ( )
        {
            for ( int i = 0; i < this._Buckets.Count; ++i )
            {
                Console.WriteLine("Bucket {0}: {1}", i, string.Join(",",this._Buckets[i].ToArray()));
            }
        }

    }

    class Program
    {
        static void Main ( string[] args )
        {
          
            try
            {
                List<string> Words = new List<string>();
                for (char i = 'a'; i <= 'z'; ++i)
                    for (char j = 'a'; j <= 'z'; ++j)
                        for (char k = 'a'; k <= 'z'; ++k )
                            Words.Add(new string(new char[] { i, j, k }));
                /*
                System.IO.StreamReader file = new System.IO.StreamReader(System.AppDomain.CurrentDomain.BaseDirectory + "testfiles\\dictionary.txt");
                string line;
                while ( (line = file.ReadLine()) != null )
                {
                    Words.Add(line);
                }*/
                StringSet TestSet = new StringSet(Words.ToArray());
                Console.WriteLine("Number of strings: {0}", TestSet.NumStrings);
                TestSet.Print();
            }
            catch ( Exception E )
            {
                Console.WriteLine("Exception occurred: {0}", E.Message);
            }
            Console.Read(); // just to keep console open
        }
    }
}

SlurrerOfSpeech · Apr 9, 2016

Ok, with the knowledge that the range of possible values of inputted strings is 'aaa', 'aab', ..., 'zzz', I changed my function to

Code:

        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            /*
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;*/
            return (676 * (int)('z' - s[0]) + 26 * (int)('z' - s[1]) + 1 * (int)('z' - s[2])) % Buckets.Count;
        }

and of course that worked perfectly. But I want mine to be general enough to work arbitrary strings.

mfb · Apr 9, 2016

Well, xor of ascii will not give a good hash function, no matter how exactly you implement it.
You can generalize your 3-letter approach to arbitrary strings. It is still not a good hash function, but better than xor.

What is the purpose of that hash function? Why don't you use an established hash?

FactChecker · Apr 9, 2016

How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.

jedishrfu · Apr 9, 2016

Here's the wiki writeup on it:

https://en.wikipedia.org/wiki/Hash_function

and here's a stackexchange discussion on why XOR is a poor choice for hashing:

http://stackoverflow.com/questions/5889238/why-is-xor-the-default-way-to-combine-hashes

jim mcnamara · Apr 9, 2016

I am missing something.

It seems you are having trouble with underlying concepts, in which case we supplied good links.
Otherwise, If this is for a job then why not use known good algorithms. FNV hash comes mind.

Not doing either of these is getting you nowhere. Why not read?

SlurrerOfSpeech · Apr 9, 2016

FactChecker said:

How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.

Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations

jim mcnamara said:

Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

That eternallyconfuzzled article is really good. I implemented the Bernstein hash and it worked out well:
Now I'm going to run a performance test.

jim mcnamara said:

Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

That eternallyconfuzzled article was very good. I tried the Bernstein hash and it worked well:

I'm going to run a performance test and I'll post back results.

SlurrerOfSpeech · Apr 10, 2016

Alright, check this out:

That is the result of my performance test

Code:

        public void RunPerformanceTest ( )
        {
            // tests string hash set against regular List<string> 
            // in terms of lookup

            // create list of all elements in hash set
            List<string> ListVersion = new List<string>(),
                                Copy = new List<string>();
            foreach ( List<string> Bucket in this._Buckets )
            {
                foreach ( string Str in Bucket )
                {
                    ListVersion.Add(Str);
                    Copy.Add(Str);
                }                    
            }

            // calculate average time to look up all elements
            Stopwatch Watch = new Stopwatch();
            long tList = 0, tHset = 0; // ms
            foreach ( string Str in Copy )
            {
                // measure time to look up string in ordinary list
                Watch.Start();
                if ( ListVersion.Contains(Str) ) { }
                Watch.Stop();
                tList += Watch.ElapsedTicks;
                // now measure time to look up same string in my hash set
                Watch.Reset();
                Watch.Start();
                if ( this.Contains(Str) ) { }
                Watch.Stop();
                tHset += Watch.ElapsedTicks;
                Watch.Reset();
            }
            int n = Copy.Count;
            Console.WriteLine("Average ticks to look up in List: {0}", tList / n);
            Console.WriteLine("Average ticks to look up in hashset: {0}", tHset / n);
        }

Thoughts?

FactChecker · Apr 10, 2016

That is very interesting from a learning point-of-view,

For professional programming, you should use the tools that are already in the language. Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented. They work well, take care of collisions, allocate memory, etc. The C++ map is a hash implementation. The implementations in the language will probably be optimized in ways that are hard to compete with.

mfb · Apr 10, 2016

SlurrerOfSpeech said:

Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations

XOR is the best approach if you have to treat everything bit by bit. That is an arbitrary constraint that rules out all good hash functions.

FactChecker said:

Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented.

And in general, you won't beat those implementations until you really know what you do.

jim mcnamara · Apr 10, 2016

@SlurrerOfSpeech - great job! That is how to learn this stuff.

SlurrerOfSpeech · Apr 10, 2016

I ran the same test with an English dictionary of ~350,000 words. The test took over an hour to run and here are the results:

That's a 1460x difference in performance. Wow.

mfb · Apr 11, 2016

What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.

SlurrerOfSpeech · Apr 11, 2016

mfb said:

What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.

That is my computed difference in performance of lookup between having the 350k words in an unordered List<string> versus the StringSet class that I created.

mfb · Apr 11, 2016

Ah, unordered list. Then it is not surprising.

rcgldr · Apr 11, 2016

A CRC is sometimes used as a hash function, but the number of buckets needs to be a power of 2 to correspond with the CRC.

Why isn't my hash set doing a good job?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

File Structure vs Data Structure

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight