Associative Arrays in C/C++ - Favorite Strategies?

  • Context: C/C++ 
  • Thread starter Thread starter lpetrich
  • Start date Start date
  • Tags Tags
    Arrays associative
Click For Summary
SUMMARY

This discussion focuses on implementing associative arrays in C/C++ using strategies similar to those found in languages like Perl and Python. Key techniques include using hash tables for fast lookups, specifically leveraging the C++ Standard Template Library's std::unordered_map introduced in C++11. Participants shared their experiences with hash collision handling, indexing strategies, and performance optimizations, emphasizing the importance of hash code comparisons before element-by-element checks. The conversation also highlighted the potential pitfalls of using outdated GCC versions and the advantages of utilizing Boost libraries for enhanced functionality.

PREREQUISITES
  • Understanding of C++ Standard Template Library (STL)
  • Familiarity with hash table concepts and collision resolution
  • Knowledge of performance optimization techniques in C++
  • Experience with GCC and Boost libraries for C++ development
NEXT STEPS
  • Explore the implementation and usage of std::unordered_map in C++11
  • Research hash table collision resolution techniques and their impact on performance
  • Learn about the Boost libraries and their advantages for C++ associative containers
  • Investigate the differences between various GCC versions and their licensing implications
USEFUL FOR

Software developers, particularly those working with C/C++ who need to implement efficient data structures, as well as anyone interested in optimizing performance through advanced data handling techniques.

lpetrich
Science Advisor
Messages
998
Reaction score
180
Associative Arrays in C/C++ -- Favorite Strategies?

I mean by that something like what Perl or Python or Ruby have, a.k.a. dictionary types. I can do it in Mathematica with the help of label[key] = value and Head and DownValues, but it's C/C++ where it is the most difficult of all.


I once had the task of searching an array with elements {params, data} -- find data for some params value. I used a speedup that consisted of making a hash table of
(hash value for params) -> (index in data array)

I got around the hash-collision problem by checking the element at the data-array index. If it has the right params value, then I've succeeded, but if not, then I'd search for params in the data array and update the hash table with the index that I found.


Most recently, in my Lie-algebra code, I've used the C++ Standard Template Library container "map". I wanted to see if some calculated root vector was already in some list of them, so I created a map object that contains the listed root vectors. Searching in the map object requires doing a compare operation, but I found a way to speed it up relative to an element-by-element compare.

When I add a root vector to the map object, I calculate and include a hash code for it. When it comes time to compare, I first compare the hash codes, then if they are equal, I do an element-by-element comparison. This strategy produces a rather noticeable speedup.


In summary, do a fast lookup or compare, and if that fails or returns equal, then do a slow but accurate one.


Are these reasonable strategies?
 
Technology news on Phys.org


phiby said:
There is a proposal to add hash_map to the STL - however, I am not sure if it's been accepted or not?

Yep, it's in C++11 as the class template std::unordered_map, which was pretty much taken verbatim from the Boost class template of the same name (but of course it's in namespace Boost in Boost rather than in namespace std).
 


If the data size is small enough and ram size is large enough, you can use data as an index to a huge array of addresses to create an associatve array. Initializing the entire array with "invalid address" entries (use a flag or some invalid address value) is time consuming, but access time is quick.

Otherwise you use some type of hash and/or search algorithm as already mentioned.
 
Last edited:


Thanx. Seems like I'm on the right track.

I've been using OSX Snow Leopard's out-of-the-box gcc, which is 4.2. However, I can get gcc up to 4.6 on the OSX package manager fink.

I now recall that in some recent hash-array code, I'd created an index-chain array, where the value for each index is another index whose data has the same hash value. The end of a chain is marked out with NONE (-1). That speeds up searching compared to my earlier effort.

In both of them, I'd used a fixed-size hash array, and a power of 2 at that. I think that if I made the hash table variable-sized, then I'd have to recalculate it every time I changed its size.I also checked on gcc's implementation of STL's associative containers. Those objects use a red-black binary tree.

For N elements, the search time:
Binary tree: log(N)
Linear-search hash table: N/H (H = hash-table size)

If one creates a hash table with a binary tree for each hash value, the search time would become
log(N/H) = log(N) - log(H)
 


lpetrich said:
If one creates a hash table with a binary tree for each hash value, the search time would become
log(N/H) = log(N) - log(H)

You have to be careful with hash-maps because if the hashing algorithm is not uniform then you will get more collisions in certain areas depending on your input data (keys).

The other thing is that along with the above, your map needs to take into account the number of entries because whenever you do get a collision, what will happen then is that it will traverse usually a linked-list structure to get the actual structure pointer which means that now you need to consider that not only are you now searching in linear Big O time, but since it is a linked-list you will have indirection which means that it will not be as fast as a normal array and that the compiler can not optimize the code due to the nature of your bin structure being a linked list.

Even if you used something like a binary-tree for the internal bin structure, you still have indirection and the point of the hash-map is to reduce the search and lookup time of an object to something resembling O(C) and not O(logn): if you want to use something more resembling the latter, then just use a binary tree and forget about the hash-map.
 


lpetrich said:
Thanx. Seems like I'm on the right track.

I've been using OSX Snow Leopard's out-of-the-box gcc, which is 4.2. However, I can get gcc up to 4.6 on the OSX package manager fink.
If you are using 4.2.1, you can use the TR1 (a C++ evaluation committee) version of unordered_map via #include <tr1/unordered_map>.

Word of warning: There's a good reason Snow Leopard comes with a five year old version of gcc. gcc 4.2.1 was the last version of gcc released under GPL version 2. All subsequent releases, including 4.6, were released under GPL version 3. This shouldn't be a problem if you are a student. If you are working for a living, you might want to think hard about using gcc 4.6. IMHO, stay far, far away from gcc 4.6. Release your application as a binary compiled with gcc 4.6 and boom! your work just because public domain.

You can use fink to get the Boost libraries. Boost was released with a reasonable usage license. The Boost are incredibly powerful. Getting Boost is well worth looking into.
 


You can also just use hash map directly with the GCC you already have. This is what I use, it works fine with the gcc in xcode 3.x:

Code:
#include <ext/hash_map>
using namespace ::__gnu_cxx;
namespace __gnu_cxx {                                                                                             
	template<> struct hash< std::string > // Allow STL strings with hash_map
	{ size_t operator()( const std::string& x ) const { return hash< const char* >()( x.c_str() ); } };          
	template<> struct hash< void * > // Allow pointers with hash_map               
	{ size_t operator()( const void * x ) const { return hash< uintptr_t >()( (uintptr_t)x ); } };          
}

(I'm not totally sure, now that I think about it, what this code will do when you run it in clang.)

I think if you try to install your own GCC on Snow Leopard, you will regret it.
 

Similar threads

  • · Replies 25 ·
Replies
25
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
20
Views
2K
  • · Replies 133 ·
5
Replies
133
Views
11K
  • · Replies 29 ·
Replies
29
Views
3K
  • · Replies 17 ·
Replies
17
Views
3K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K