Associative Arrays in C/C++ - Favorite Strategies?

lpetrich · Mar 22, 2012

Associative Arrays in C/C++ -- Favorite Strategies?

I mean by that something like what Perl or Python or Ruby have, a.k.a. dictionary types. I can do it in Mathematica with the help of label[key] = value and Head and DownValues, but it's C/C++ where it is the most difficult of all.

I once had the task of searching an array with elements {params, data} -- find data for some params value. I used a speedup that consisted of making a hash table of
(hash value for params) -> (index in data array)

I got around the hash-collision problem by checking the element at the data-array index. If it has the right params value, then I've succeeded, but if not, then I'd search for params in the data array and update the hash table with the index that I found.

Most recently, in my Lie-algebra code, I've used the C++ Standard Template Library container "map". I wanted to see if some calculated root vector was already in some list of them, so I created a map object that contains the listed root vectors. Searching in the map object requires doing a compare operation, but I found a way to speed it up relative to an element-by-element compare.

When I add a root vector to the map object, I calculate and include a hash code for it. When it comes time to compare, I first compare the hash codes, then if they are equal, I do an element-by-element comparison. This strategy produces a rather noticeable speedup.

In summary, do a fast lookup or compare, and if that fails or returns equal, then do a slow but accurate one.

Are these reasonable strategies?

phiby · Mar 22, 2012

There is a proposal to add hash_map to the STL - however, I am not sure if it's been accepted or not?

However, a lot of STL implementations have a hash_map
http://msdn.microsoft.com/en-us/library/0d462wfh(v=vs.80).aspx
http://www.sgi.com/tech/stl/hash_map.html

D H · Mar 22, 2012

phiby said:

There is a proposal to add hash_map to the STL - however, I am not sure if it's been accepted or not?

Yep, it's in C++11 as the class template std::unordered_map, which was pretty much taken verbatim from the Boost class template of the same name (but of course it's in namespace Boost in Boost rather than in namespace std).

rcgldr · Mar 22, 2012

If the data size is small enough and ram size is large enough, you can use data as an index to a huge array of addresses to create an associatve array. Initializing the entire array with "invalid address" entries (use a flag or some invalid address value) is time consuming, but access time is quick.

Otherwise you use some type of hash and/or search algorithm as already mentioned.

lpetrich · Mar 23, 2012

Thanx. Seems like I'm on the right track.

I've been using OSX Snow Leopard's out-of-the-box gcc, which is 4.2. However, I can get gcc up to 4.6 on the OSX package manager fink.

I now recall that in some recent hash-array code, I'd created an index-chain array, where the value for each index is another index whose data has the same hash value. The end of a chain is marked out with NONE (-1). That speeds up searching compared to my earlier effort.

In both of them, I'd used a fixed-size hash array, and a power of 2 at that. I think that if I made the hash table variable-sized, then I'd have to recalculate it every time I changed its size.I also checked on gcc's implementation of STL's associative containers. Those objects use a red-black binary tree.

For N elements, the search time:
Binary tree: log(N)
Linear-search hash table: N/H (H = hash-table size)

If one creates a hash table with a binary tree for each hash value, the search time would become
log(N/H) = log(N) - log(H)

chiro · Mar 23, 2012

lpetrich said:

If one creates a hash table with a binary tree for each hash value, the search time would become
log(N/H) = log(N) - log(H)

You have to be careful with hash-maps because if the hashing algorithm is not uniform then you will get more collisions in certain areas depending on your input data (keys).

The other thing is that along with the above, your map needs to take into account the number of entries because whenever you do get a collision, what will happen then is that it will traverse usually a linked-list structure to get the actual structure pointer which means that now you need to consider that not only are you now searching in linear Big O time, but since it is a linked-list you will have indirection which means that it will not be as fast as a normal array and that the compiler can not optimize the code due to the nature of your bin structure being a linked list.

Even if you used something like a binary-tree for the internal bin structure, you still have indirection and the point of the hash-map is to reduce the search and lookup time of an object to something resembling O(C) and not O(logn): if you want to use something more resembling the latter, then just use a binary tree and forget about the hash-map.

D H · Mar 23, 2012

lpetrich said:

Thanx. Seems like I'm on the right track.

I've been using OSX Snow Leopard's out-of-the-box gcc, which is 4.2. However, I can get gcc up to 4.6 on the OSX package manager fink.

If you are using 4.2.1, you can use the TR1 (a C++ evaluation committee) version of unordered_map via #include <tr1/unordered_map>.

Word of warning: There's a good reason Snow Leopard comes with a five year old version of gcc. gcc 4.2.1 was the last version of gcc released under GPL version 2. All subsequent releases, including 4.6, were released under GPL version 3. This shouldn't be a problem if you are a student. If you are working for a living, you might want to think hard about using gcc 4.6. IMHO, stay far, far away from gcc 4.6. Release your application as a binary compiled with gcc 4.6 and boom! your work just because public domain.

You can use fink to get the Boost libraries. Boost was released with a reasonable usage license. The Boost are incredibly powerful. Getting Boost is well worth looking into.

Coin · Mar 25, 2012

You can also just use hash map directly with the GCC you already have. This is what I use, it works fine with the gcc in xcode 3.x:

Code:

#include <ext/hash_map>
using namespace ::__gnu_cxx;
namespace __gnu_cxx {                                                                                             
	template<> struct hash< std::string > // Allow STL strings with hash_map
	{ size_t operator()( const std::string& x ) const { return hash< const char* >()( x.c_str() ); } };          
	template<> struct hash< void * > // Allow pointers with hash_map               
	{ size_t operator()( const void * x ) const { return hash< uintptr_t >()( (uintptr_t)x ); } };          
}

(I'm not totally sure, now that I think about it, what this code will do when you run it in clang.)

I think if you try to install your own GCC on Snow Leopard, you will regret it.

Associative Arrays in C/C++ - Favorite Strategies?

Similar threads

How to increase phone signal strength by lying about it

Who is responsible for the software when AI takes over programming?

A Crisis for Newly Minted CompSci Majors -- entry level jobs gone

Learning Assembly and computer architecture for x86

Learning data structures and algorithms in different programming languages

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers