Why isn't my hash set doing a good job?

In summary: GetBucketNumber in the class definition return (int)s % Buckets.Count; } private void Insert ( string s ) { // better way to do this? this._Buckets.Add(new List<string>()); } public IEnumerator<string> GetEnumerator ( )
  • #1
SlurrerOfSpeech
141
11
So I created my own implementation of a hash set of strings and am XORing the characters as a hashing function:

Code:
        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;
        }

Problem is that it's not doing a good job of distributing N strings into N buckets. For instance, I initialize it with 50 buckets and add in the first 50 words of http://www.math.sjsu.edu/~foster/dictionary.txt and it ends up like

NUHckW9.png


Now I know that's not the best example since the first 50 words (alphabetized) are very similar in characters, but I've tested it out with random strings and it does the same thing.
 
Last edited by a moderator:
Technology news on Phys.org
  • #2
All the lowercase ascii characters agree in many of their bits, I guess you cannot fill more than ~30 buckets with that method. All the "a" don't make it better (note that adding two "a" to a word doesn't change the hash at all).
 
  • #3
Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
 
  • #4
Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that
 
  • #5
Yep
jedishrfu said:
Why not do a more controlled experiment like three letter sequences? i.e. aaa,aab,aac, ... xxx, xxa, xxb, xxc

If your algorithm is working I'd expect aaa to be in one bucket and aab to be in the next... or some such order as that

They're still getting distributed into the first region of buckets and nowhere else.

Allow me to do a full code dump so that you guys can try to help me find the flaw in my logic.

Code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace StringSet
{    class StringSet
    {

        private List<List<string>> _Buckets;
        public int NumStrings { get; private set; }

        public StringSet ( )
        {
            this._Buckets = new List<List<string>>();
            this.NumStrings = 0;
        }

        public StringSet ( string[] S )
        {
            // better way to do this?
            this._Buckets = new List<List<string>>();
            foreach ( string s in S ) this._Buckets.Add(new List<string>());
            foreach ( string s in S ) { this.Insert(s);  }
        }

        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;
        }

        private void _RehashIfNecessary ( )
        {
            // if the number of strings in the set exceeds the number of buckets,
            // increase the number of buckets to either double its current size
            // or the largest number of buckets possible, whichever is smaller
            if ( this.NumStrings > this._Buckets.Count )
            {
                List<List<string>> NewBuckets = new List<List<string>>(Math.Min(this._Buckets.Count * 2, Int32.MaxValue));
                foreach ( List<string> Bucket in this._Buckets )
                {
                    foreach ( string s in Bucket )
                    {
                        NewBuckets[this._GetBucketNumber(s, NewBuckets)].Add(s);
                    }
                }
                this._Buckets = NewBuckets;
            }
        }

        public void Insert ( string s )
        {
            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }

            // Get bucket that string belongs in
            List<string> Bucket = this._Buckets[this._GetBucketNumber(s,this._Buckets)];
            // Add if not already there
            if ( Bucket.IndexOf(s) == -1 ) { Bucket.Add(s); }
            ++NumStrings; _RehashIfNecessary();
        }
      
        public bool Contains ( string s )
        {
            // returns true or false depending on whether s is a
            // string currently in the set

            return (this._Buckets[this._GetBucketNumber(s,this._Buckets)].IndexOf(s) != -1);
        }
      
        public void Print ( )
        {
            for ( int i = 0; i < this._Buckets.Count; ++i )
            {
                Console.WriteLine("Bucket {0}: {1}", i, string.Join(",",this._Buckets[i].ToArray()));
            }
        }

    }

    class Program
    {
        static void Main ( string[] args )
        {
          
            try
            {
                List<string> Words = new List<string>();
                for (char i = 'a'; i <= 'z'; ++i)
                    for (char j = 'a'; j <= 'z'; ++j)
                        for (char k = 'a'; k <= 'z'; ++k )
                            Words.Add(new string(new char[] { i, j, k }));
                /*
                System.IO.StreamReader file = new System.IO.StreamReader(System.AppDomain.CurrentDomain.BaseDirectory + "testfiles\\dictionary.txt");
                string line;
                while ( (line = file.ReadLine()) != null )
                {
                    Words.Add(line);
                }*/
                StringSet TestSet = new StringSet(Words.ToArray());
                Console.WriteLine("Number of strings: {0}", TestSet.NumStrings);
                TestSet.Print();
            }
            catch ( Exception E )
            {
                Console.WriteLine("Exception occured: {0}", E.Message);
            }
            Console.Read(); // just to keep console open
        }
    }
}
 
  • #6
Ok, with the knowledge that the range of possible values of inputted strings is 'aaa', 'aab', ..., 'zzz', I changed my function to

Code:
        private int _GetBucketNumber ( string s, List<List<string>> Buckets )
        {
            //       s: string whose index to look up
            // Buckets: source buckets

            // disallow empty or NULL strings
            if ( String.IsNullOrEmpty(s) ) { throw new ArgumentException("Cannot add empty or NULL string to set"); }
            if ( Buckets.Count == 0 ) { throw new ArgumentException("Tried to call _GetBucketNumber on empty bucket list"); }

            // XOR characters together and mod by length of buckets
            /*
            char c = s[0];
            for ( int i = 1; i < s.Length; ++i ) { c ^= s[i]; }
            return (int)c % Buckets.Count;*/
            return (676 * (int)('z' - s[0]) + 26 * (int)('z' - s[1]) + 1 * (int)('z' - s[2])) % Buckets.Count;
        }

and of course that worked perfectly. But I want mine to be general enough to work arbitrary strings.
 
  • #7
Well, xor of ascii will not give a good hash function, no matter how exactly you implement it.
You can generalize your 3-letter approach to arbitrary strings. It is still not a good hash function, but better than xor.

What is the purpose of that hash function? Why don't you use an established hash?
 
Last edited:
  • Like
Likes QuantumQuest and FactChecker
  • #8
How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.
 
Last edited:
  • Like
Likes QuantumQuest
  • #10
I am missing something.

It seems you are having trouble with underlying concepts, in which case we supplied good links.
Otherwise, If this is for a job then why not use known good algorithms. FNV hash comes mind.

Not doing either of these is getting you nowhere. Why not read?
 
  • #11
FactChecker said:
How can an algorithm with such predictable and localized bit effects ever be a good hash function? The whole idea of the hash function is to make inputs with very similar bits map to a wide spread of values. Changing one input bit anywhere should change a lot of output bits in an "unpredictable" way.

Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations
jim mcnamara said:
Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

That eternallyconfuzzled article is really good. I implemented the Bernstein hash and it worked out well:
Now I'm going to run a performance test.
jim mcnamara said:
Check the concept of avalanche - https://en.wikipedia.org/wiki/Avalanche_effect If your data is going to be from a known ,maybe large, set of values you can use one of the perfect hash tools - feed it all possible data and it constructs a hash with no collisions for that data set. Or a subset of that. This has the advantage that you do not need code to detect and handle collisions.

gperf is tool for that - https://www.gnu.org/software/gperf/manual/gperf.html

Examples of more general hashing algorithms: http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

That eternallyconfuzzled article was very good. I tried the Bernstein hash and it worked well:

3yc9qBW.png
I'm going to run a performance test and I'll post back results.
 
  • #12
Alright, check this out:

Ibyc9Q8.png


That is the result of my performance test

Code:
        public void RunPerformanceTest ( )
        {
            // tests string hash set against regular List<string> 
            // in terms of lookup

            // create list of all elements in hash set
            List<string> ListVersion = new List<string>(),
                                Copy = new List<string>();
            foreach ( List<string> Bucket in this._Buckets )
            {
                foreach ( string Str in Bucket )
                {
                    ListVersion.Add(Str);
                    Copy.Add(Str);
                }                    
            }

            // calculate average time to look up all elements
            Stopwatch Watch = new Stopwatch();
            long tList = 0, tHset = 0; // ms
            foreach ( string Str in Copy )
            {
                // measure time to look up string in ordinary list
                Watch.Start();
                if ( ListVersion.Contains(Str) ) { }
                Watch.Stop();
                tList += Watch.ElapsedTicks;
                // now measure time to look up same string in my hash set
                Watch.Reset();
                Watch.Start();
                if ( this.Contains(Str) ) { }
                Watch.Stop();
                tHset += Watch.ElapsedTicks;
                Watch.Reset();
            }
            int n = Copy.Count;
            Console.WriteLine("Average ticks to look up in List: {0}", tList / n);
            Console.WriteLine("Average ticks to look up in hashset: {0}", tHset / n);
        }

Thoughts?
 
  • Like
Likes FactChecker
  • #13
That is very interesting from a learning point-of-view,

For professional programming, you should use the tools that are already in the language. Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented. They work well, take care of collisions, allocate memory, etc. The C++ map is a hash implementation. The implementations in the language will probably be optimized in ways that are hard to compete with.
 
  • #14
SlurrerOfSpeech said:
Hmmm I always thought that XOR was considered the best operator for hashing functions. Because with AND or OR the successive operations
XOR is the best approach if you have to treat everything bit by bit. That is an arbitrary constraint that rules out all good hash functions.
FactChecker said:
Many languages (C++, Perl, Python, etc. ) already have a hash (aka. associative array, map) implemented.
And in general, you won't beat those implementations until you really know what you do.
 
  • Like
Likes FactChecker
  • #16
I ran the same test with an English dictionary of ~350,000 words. The test took over an hour to run and here are the results:

JK4z1S8.png


That's a 1460x difference in performance. Wow.
 
  • #17
What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.
 
  • #18
mfb said:
What do you look up how? A binary search in a sorted list should not take more than 19 accesses with 350.000 words, and it avoids any hash collisions, so I don't see where the factor 1460 comes from.

That is my computed difference in performance of lookup between having the 350k words in an unordered List<string> versus the StringSet class that I created.
 
  • #19
Ah, unordered list. Then it is not surprising.
 
  • #20
A CRC is sometimes used as a hash function, but the number of buckets needs to be a power of 2 to correspond with the CRC.
 

1. Why is my hash set not returning the correct value for a given key?

There are a few potential reasons for this. One possibility is that the hash function being used is not properly distributing the keys, resulting in collisions. Another possibility is that the hash set is not properly handling the collision resolution method. It is also possible that there is a bug in the implementation of the hash set. It is important to carefully review the code and ensure that all methods are functioning as intended.

2. How can I improve the performance of my hash set?

There are a few strategies that can be used to improve the performance of a hash set. One approach is to use a better hash function that distributes the keys more evenly. Another strategy is to adjust the load factor, which determines the number of elements in the hash set before it needs to be resized. Additionally, it may be helpful to examine the collision resolution method being used and consider implementing a more efficient approach.

3. Why is my hash set taking up so much memory?

The amount of memory used by a hash set can vary depending on the size of the set and the implementation. However, if the hash set is taking up a significant amount of memory, it could be a sign that the load factor is set too high. This means that there are too many elements in the hash set, resulting in frequent resizing and increased memory usage. Consider adjusting the load factor to reduce the memory footprint.

4. Can I store duplicate elements in a hash set?

No, hash sets do not allow duplicate elements. When attempting to add a duplicate element, the hash set will simply ignore the operation and return false. If you need to store duplicate elements, consider using a different data structure such as a list or a multiset.

5. How do I know if my hash set is thread-safe?

It depends on the implementation of the hash set. Some hash set implementations are inherently thread-safe, while others may require additional synchronization to ensure thread safety. It is important to carefully review the documentation or source code of the hash set to determine if it is safe to use in a multi-threaded environment.

Similar threads

  • Programming and Computer Science
Replies
2
Views
2K
  • Programming and Computer Science
Replies
3
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • Programming and Computer Science
Replies
16
Views
3K
  • Programming and Computer Science
2
Replies
49
Views
10K
  • Programming and Computer Science
Replies
1
Views
4K
  • Programming and Computer Science
Replies
2
Views
13K
  • Engineering and Comp Sci Homework Help
Replies
0
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
3K
Back
Top