Why isn't my hash set doing a good job?

  • Context:
  • Thread starter Thread starter SlurrerOfSpeech
  • Start date Start date
  • Tags Tags
    Data sets Job Set
Click For Summary

Discussion Overview

The discussion revolves around the implementation of a hash set for strings, specifically focusing on the effectiveness of a hashing function that uses XOR operations on characters. Participants explore issues related to the distribution of strings into buckets and propose various methods to improve the hashing mechanism.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Experimental/applied

Main Points Raised

  • One participant describes their implementation of a hash set and notes that the XOR-based hashing function does not distribute strings well across buckets, even with random strings.
  • Another participant points out that lowercase ASCII characters share many bits, suggesting that the XOR method may limit effective bucket usage to around 30 buckets.
  • A different participant introduces the concept of the avalanche effect and suggests using perfect hash functions for known datasets to avoid collisions, mentioning tools like gperf.
  • One participant proposes testing the hashing function with controlled three-letter sequences to evaluate its distribution more effectively.
  • Another participant echoes the suggestion for controlled experiments, expressing concern that the current implementation does not distribute strings adequately across buckets.
  • A participant shares a code snippet and describes their approach to rehashing when the number of strings exceeds the number of buckets, indicating an attempt to improve the implementation.
  • One participant revises their bucket number calculation method but does not provide the complete new implementation, indicating ongoing experimentation.

Areas of Agreement / Disagreement

Participants express varying opinions on the effectiveness of the XOR hashing method, with some suggesting alternative approaches and others agreeing on the need for controlled experiments. The discussion remains unresolved regarding the best hashing strategy.

Contextual Notes

There are limitations in the current hashing approach, including assumptions about the input data and the potential for collisions due to the nature of the XOR operation. The effectiveness of the proposed solutions has not been fully explored or validated.

SlurrerOfSpeech
Messages
141
Reaction score
11
So I created my own implementation of a hash set of strings and am XORing the characters as a hashing function:

Code:
        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;
        }

Problem is that it's not doing a good job of distributing N strings into N buckets. For instance, I initialize it with 50 buckets and add in the first 50 words of http://www.math.sjsu.edu/~foster/dictionary.txt and it ends up like

NUHckW9.png


Now I know that's not the best example since the first 50 words (alphabetized) are very similar in characters, but I've tested it out with random strings and it does the same thing.
 
Last edited by a moderator:
Technology news on Phys.org
All the lowercase ascii characters agree in many of their bits, I guess you cannot fill more than ~30 buckets with that method. All the "a" don't make it better (note that adding two "a" to a word doesn't change the hash at all).
 
Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
 
Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that
 
Yep
jedishrfu said:
Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that

They're still getting distributed into the first region of buckets and nowhere else.

Allow me to do a full code dump so that you guys can try to help me find the flaw in my logic.

Code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace StringSet
{    class StringSet
    {

        private List<List<string>> _Buckets;
        public int NumStrings { get; private set; }

        public StringSet ( )
        {
            this._Buckets = new List<List<string>>();
            this.NumStrings = 0;
        }

        public StringSet ( string[] S )
        {
            // better way to do this?
            this._Buckets = new List<List<string>>();
            foreach ( string s in S ) this._Buckets.Add(new List<string>());
            foreach ( string s in S ) { this.Insert(s);  }
        }

        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;
        }

        private void _RehashIfNecessary ( )
        {
            // if the number of strings in the set exceeds the number of buckets,
            // increase the number of buckets to either double its current size
            // or the largest number of buckets possible, whichever is smaller
            if ( this.NumStrings > this._Buckets.Count )
            {
                List<List<string>> NewBuckets = new List<List<string>>(Math.Min(this._Buckets.Count * 2, Int32.MaxValue));
                foreach ( List<string> Bucket in this._Buckets )
                {
                    foreach ( string s in Bucket )
                    {
                        NewBuckets[this._GetBucketNumber(s, NewBuckets)].Add(s);
                    }
                }
                this._Buckets = NewBuckets;
            }
        }

        public void Insert ( string s )
        {
            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }

            // Get bucket that string belongs in
            List<string> Bucket = this._Buckets[this._GetBucketNumber(s,this._Buckets)];
            // Add if not already there
            if ( Bucket.IndexOf(s) == -1 ) { Bucket.Add(s); }
            ++NumStrings; _RehashIfNecessary();
        }
      
        public bool Contains ( string s )
        {
            // returns true or false depending on whether s is a
            // string currently in the set

            return (this._Buckets[this._GetBucketNumber(s,this._Buckets)].IndexOf(s) != -1);
        }
      
        public void Print ( )
        {
            for ( int i = 0; i < this._Buckets.Count; ++i )
            {
                Console.WriteLine("Bucket {0}: {1}", i, string.Join(",",this._Buckets[i].ToArray()));
            }
        }

    }

    class Program
    {
        static void Main ( string[] args )
        {
          
            try
            {
                List<string> Words = new List<string>();
                for (char i = 'a'; i <= 'z'; ++i)
                    for (char j = 'a'; j <= 'z'; ++j)
                        for (char k = 'a'; k <= 'z'; ++k )
                            Words.Add(new string(new char[] { i, j, k }));
                /*
                System.IO.StreamReader file = new System.IO.StreamReader(System.AppDomain.CurrentDomain.BaseDirectory + "testfiles\\dictionary.txt");
                string line;
                while ( (line = file.ReadLine()) != null )
                {
                    Words.Add(line);
                }*/
                StringSet TestSet = new StringSet(Words.ToArray());
                Console.WriteLine("Number of strings: {0}", TestSet.NumStrings);
                TestSet.Print();
            }
            catch ( Exception E )
            {
                Console.WriteLine("Exception occurred: {0}", E.Message);
            }
            Console.Read(); // just to keep console open
        }
    }
}
 
Ok, with the knowledge that the range of possible values of inputted strings is 'aaa', 'aab', ..., 'zzz', I changed my function to

Code:
        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            /*
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;*/
            return (676 * (int)('z' - s[0]) + 26 * (int)('z' - s[1]) + 1 * (int)('z' - s[2])) % Buckets.Count;
        }

and of course that worked perfectly. But I want mine to be general enough to work arbitrary strings.
 
Well, xor of ascii will not give a good hash function, no matter how exactly you implement it.
You can generalize your 3-letter approach to arbitrary strings. It is still not a good hash function, but better than xor.

What is the purpose of that hash function? Why don't you use an established hash?
 
Last edited:
  • Like
Likes   Reactions: QuantumQuest and FactChecker
How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.
 
Last edited:
  • Like
Likes   Reactions: QuantumQuest
  • #10
I am missing something.

It seems you are having trouble with underlying concepts, in which case we supplied good links.
Otherwise, If this is for a job then why not use known good algorithms. FNV hash comes mind.

Not doing either of these is getting you nowhere. Why not read?
 
  • #11
FactChecker said:
How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.

Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations
jim mcnamara said:
Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

That eternallyconfuzzled article is really good. I implemented the Bernstein hash and it worked out well:
Now I'm going to run a performance test.
jim mcnamara said:
Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

That eternallyconfuzzled article was very good. I tried the Bernstein hash and it worked well:

3yc9qBW.png
I'm going to run a performance test and I'll post back results.
 
  • #12
Alright, check this out:

Ibyc9Q8.png


That is the result of my performance test

Code:
        public void RunPerformanceTest ( )
        {
            // tests string hash set against regular List<string> 
            // in terms of lookup

            // create list of all elements in hash set
            List<string> ListVersion = new List<string>(),
                                Copy = new List<string>();
            foreach ( List<string> Bucket in this._Buckets )
            {
                foreach ( string Str in Bucket )
                {
                    ListVersion.Add(Str);
                    Copy.Add(Str);
                }                    
            }

            // calculate average time to look up all elements
            Stopwatch Watch = new Stopwatch();
            long tList = 0, tHset = 0; // ms
            foreach ( string Str in Copy )
            {
                // measure time to look up string in ordinary list
                Watch.Start();
                if ( ListVersion.Contains(Str) ) { }
                Watch.Stop();
                tList += Watch.ElapsedTicks;
                // now measure time to look up same string in my hash set
                Watch.Reset();
                Watch.Start();
                if ( this.Contains(Str) ) { }
                Watch.Stop();
                tHset += Watch.ElapsedTicks;
                Watch.Reset();
            }
            int n = Copy.Count;
            Console.WriteLine("Average ticks to look up in List: {0}", tList / n);
            Console.WriteLine("Average ticks to look up in hashset: {0}", tHset / n);
        }

Thoughts?
 
  • Like
Likes   Reactions: FactChecker
  • #13
That is very interesting from a learning point-of-view,

For professional programming, you should use the tools that are already in the language. Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented. They work well, take care of collisions, allocate memory, etc. The C++ map is a hash implementation. The implementations in the language will probably be optimized in ways that are hard to compete with.
 
  • #14
SlurrerOfSpeech said:
Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations
XOR is the best approach if you have to treat everything bit by bit. That is an arbitrary constraint that rules out all good hash functions.
FactChecker said:
Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented.
And in general, you won't beat those implementations until you really know what you do.
 
  • Like
Likes   Reactions: FactChecker
  • #16
I ran the same test with an English dictionary of ~350,000 words. The test took over an hour to run and here are the results:

JK4z1S8.png


That's a 1460x difference in performance. Wow.
 
  • #17
What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.
 
  • #18
mfb said:
What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.

That is my computed difference in performance of lookup between having the 350k words in an unordered List<string> versus the StringSet class that I created.
 
  • #19
Ah, unordered list. Then it is not surprising.
 
  • #20
A CRC is sometimes used as a hash function, but the number of buckets needs to be a power of 2 to correspond with the CRC.
 

Similar threads

Replies
2
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 16 ·
Replies
16
Views
5K
  • · Replies 49 ·
2
Replies
49
Views
12K
  • · Replies 3 ·
Replies
3
Views
10K
  • · Replies 1 ·
Replies
1
Views
4K
Replies
1
Views
14K
  • · Replies 4 ·
Replies
4
Views
5K