Efficient Set Membership Testing with Hashtable?

CRGreathouse · Aug 22, 2007

I'm looking for a fast data structure for determining set membership. Suppose I have 10,000 numbers ranging from 1 to 10 billion or so and I want a quick test to see if a given number n is 'probably' one of those 10,000 values. A hashtable seems similar to what I want, but as I have no value to associate with these numbers it doesn't seem quite right -- and I'm pretty sure I've seen a data structure designed for just this problem.

In particular, I'd like the structure, once initialized (it won't change thereafter), to return "true" if a given member is in the set with high probability, and "false" otherwise. 99.9% would quite suffice, and even lower probabilities may be acceptable.

CRGreathouse · Aug 22, 2007

Wait, maybe I just need to sort the list and use binary search. Thoughts?

mgb_phys · Aug 22, 2007

If you have a single permanent list at the start and don't have to add new numbers to it that would seem to make sense.
Depending on how much memory you have and the distribution of the numbers you might do better to have a hash as a prefilter.

You could hash based on the sum of the digits which would lead to a table saying wether any numbers in the llist have this sum, or you could have an array of say 16K entries which correspond to the most/least significant word of the number, and where each entry stores just a bool of wether any numbers with this starting word are in the list.

You would then do a search into the main list to see if there was an exact match, Knuth has whole chapters on how to do this efficently!

nmtim · Aug 23, 2007

What language are you using? C++ for example provides set data structures. Have you tried whatever your language provides? How did it perform?

CRGreathouse · Aug 23, 2007

nmtim said:

What language are you using? C++ for example provides set data structures. Have you tried whatever your language provides? How did it perform?

At the moment I'm using C++, though without any optimizations -- I'm just trying to find good algorithms for this problem. The latest version of my program (still very much in development), in broad strokes:

Populate the list. At the moment the list is about 40 MB, with several million members.
Sort the list -- since the elements beyond the first few are pseudorandom with fair dispersion, I'm using the last element as a pivot. (If this was a bottleneck I might take the median of the last three members instead.)
Discard duplicates, freeing space at the end of the array.
Run the main section (four nested loops):
- Check if a given branch can be ruled out by congruence conditions. If not,
- Use a binary search on the large, pregenerated list to see if the member is present.

So at the moment I'm using no special library features -- just standard arrays and my own functions to manipulate data.

CRGreathouse · Aug 23, 2007

mgb_phys said:

If you have a single permanent list at the start and don't have to add new numbers to it that would seem to make sense.
Depending on how much memory you have and the distribution of the numbers you might do better to have a hash as a prefilter.

I have plenty of memory, and processing power is currently more of a bottleneck than memory. The table is far too large to fit into any cache, so memory access is presumably pretty bad but I can't do much about that.

The numbers in the table/hash are generated modulo a large constant, so they're pseudorandom and essentially uniformly distributed.

mgb_phys said:

You could hash based on the sum of the digits which would lead to a table saying wether any numbers in the llist have this sum, or you could have an array of say 16K entries which correspond to the most/least significant word of the number, and where each entry stores just a bool of wether any numbers with this starting word are in the list.

Hmm, that sounds like a good idea. Just right shift and lookup, that's it?

mgb_phys · Aug 23, 2007

Yes, you don't need to be too picky about the hash function - you just want something where you have a reasonable discrimination between present and not present,
and produces a reasonable number of ouputs, too many and you are going to use more memory to store the hash table than the data, too few and you will always get hits.

Basically you are looking for some parameter in which the numbers aren't randomly distributed - so you have good discrimination.
The high word approach would be good if the numbers are clustered.

CRGreathouse · Aug 23, 2007

Ah, I found it! The structure I was looking for that I had read about before is the Bloom filter. Has anyone used one of these, or something similar? It's a probabilistic set membership test -- if something fails it's certainly not a member, which is the property I want. I can test positives in a number of different ways, but as it is I haven't found one yet, so for the purpose of optimization they don't exist.

mgb_phys said:

Basically you are looking for some parameter in which the numbers aren't randomly distributed - so you have good discrimination.
The high word approach would be good if the numbers are clustered.

Obviously I'll have to test it, but I think it should work. I do know that the least significant bits aren't good -- the numbers all have the same parity and there are some other trends in the next two bits.

Hurkyl · Aug 23, 2007

You should actually check if a Bloom filter really is an improvement over just storing a boolean in each bucket of an ordinary hash table. (equivalently, to using k=1 in your Bloom filter) The Bloom filter's strength is in space optimization, at the cost of extra memory accesses and hash calculations. The only way that it leads to a time improvement is if it allows you to use a better memory access pattern. (e.g. if you can fit the hash table into your L1 cache)

For comparison, note that just storing a 2 gigabyte array of Booleans requires just one memory access and no hash calculuations per lookup. You can avoid the bit twiddling if you unpack it to a 10 or even 80 gigabyte array. The reason this is not a time advantage is because you never get any cache hits, and it requires disk accesses if you don't have enough real RAM.

rcgldr · Aug 24, 2007

update - as Hurkyl mentioned, just storing an array is the quickest. You stated that the range of the numbers was 0 to 10 billion. Assuming these numbers are exact (integers), and you only need to know if a number is in the list or not, then you just need 10 billion bits, or 1.25GB's of ram to do the job, 2GB of ram total should be enough (Using task manager I see 1.57GB of free physical ram on my system with 2GB, so if the rest of the stuff can fit in 250MB of ram, it's enough).

If you needed a sort:
For the sorting, a merge sort is the quickest (fewer operations, lots of cache hits), and it's very easy to skip (delete) duplicates with a merge sort algorithm. In your case, the elements are small, so just sort the elements. If the elements were larger, then it would be better to sort pointers and do a single copy/move. The overhead of a merge sort is that is uses twice as much memory (although its just twice as much pointer memory for larger elements).

jim mcnamara · Aug 24, 2007

Have you considered a sparse array?

CRGreathouse · Aug 24, 2007

Jeff Reid said:

update - as Hurkyl mentioned, just storing an array is the quickest. You stated that the range of the numbers was 0 to 10 billion. Assuming these numbers are exact (integers), and you only need to know if a number is in the list or not, then you just need 10 billion bits, or 1.25GB's of ram to do the job, 2GB of ram total should be enough (Using task manager I see 1.57GB of free physical ram on my system with 2GB, so if the rest of the stuff can fit in 250MB of ram, it's enough).

Yes, I'm finding that arrays are working pretty well. I'm trying to convert to a more efficient form right now.

CRGreathouse · Aug 24, 2007

Hurkyl said:

You should actually check if a Bloom filter really is an improvement over just storing a boolean in each bucket of an ordinary hash table. (equivalently, to using k=1 in your Bloom filter) The Bloom filter's strength is in space optimization, at the cost of extra memory accesses and hash calculations. The only way that it leads to a time improvement is if it allows you to use a better memory access pattern. (e.g. if you can fit the hash table into your L1 cache)

Of course I don't know which way the problem will go, so if the space requirements get large I'll have to make some other structure, possibly a Bloom filter if I can write an efficient implementation. Of course depending on how I do it a pair of bit arrays using different moduli may function better. Edit: but that's not too different from a Bloom filter in the first place...

CRGreathouse · Aug 24, 2007

jim mcnamara said:

Have you considered a sparse array?

Well, that's an interesting thing. I can use a bitarray as long as the modulus, and waste bits, or I can make an array of 32-bit ints as long as my z-limit and waste processor power. At the moment I'm comparing the methods, but it seems so far that the less-sophisticated bit array is faster, and my program crashes (!) before I hit any kind of memory limit.

Hurkyl · Aug 24, 2007

CRGreathouse said:

Of course I don't know which way the problem will go, so if the space requirements get large I'll have to make some other structure, possibly a Bloom filter if I can write an efficient implementation. Of course depending on how I do it a pair of bit arrays using different moduli may function better. Edit: but that's not too different from a Bloom filter in the first place...

Right -- the two implementations are essentially identical. The Bloom filter would just make both lookups in the same array, rather than in two different arrays... and while the Bloom filter should have a better false positive rate, the array is so sparse that the effect should be negligible.

Incidentally, if you're serious about speed, you should avoid doing any sort of division: it's a slow operation. If your modulus is a power of 2, you can simply do a bitmask to compute the remainder... but you only get one hash function that way. To build more, you should try and make them out of bit operations, or maybe invoke a multiplication if necessary.

Hurkyl · Aug 24, 2007

Jeff Reid said:

If you needed a sort:
For the sorting, a merge sort is the quickest (fewer operations, lots of cache hits), and it's very easy to skip (delete) duplicates with a merge sort algorithm. In your case, the elements are small, so just sort the elements.

Better than quicksort?

Anyways, I'd imagine a radix sort ought to beat any comparison-based sort.

But really, if you only have a few thousand integers to sort, and only need to do it once, then even a bubble sort should be good enough!

CRGreathouse · Aug 24, 2007

Hurkyl said:

Incidentally, if you're serious about speed, you should avoid doing any sort of division: it's a slow operation. If your modulus is a power of 2, you can simply do a bitmask to compute the remainder... but you only get one hash function that way. To build more, you should try and make them out of bit operations, or maybe invoke a multiplication if necessary.

In more than three-quarters of the cases I consider I use a special case that uses only boolean operations. The hard part that dominates runtime is the remainder -- the part I'm discussing here. So in effect I'm already using that filter, in a highly optimized fashion, and this is just looking for a good way to do the rest. Now I might use a shift and AND mask as a crude filter, working only because I believe the 'middle bits' to be well-dispersed. This would be something like (n >> 3) & 0xFFFFFF.

Hurkyl · Aug 24, 2007

As suggested, I've split the sorting discussion into a new thread:

https://www.physicsforums.com/showthread.php?t=181589

(Should any of the previous ones be moved over too? This seemed like a good break point)

CRGreathouse · Aug 24, 2007

For those interested in the math behind this, I have a post on the Number Theory forum here:
https://www.physicsforums.com/showthread.php?t=181538

Efficient Set Membership Testing with Hashtable?

1. What is a Hashtable for booleans?

2. How is a Hashtable for booleans different from a regular Hashtable?

3. What are the advantages of using a Hashtable for booleans?

4. Can a Hashtable for booleans handle null values?

5. How do you add or remove elements from a Hashtable for booleans?

Similar threads

Hot Threads

Recent Insights