Associative Arrays in C/C++ - Favorite Strategies?

  • Context: C/C++ 
  • Thread starter Thread starter lpetrich
  • Start date Start date
  • Tags Tags
    Arrays associative
Click For Summary

Discussion Overview

The discussion revolves around strategies for implementing associative arrays in C/C++, akin to dictionary types found in languages like Perl, Python, or Ruby. Participants explore various methods, including hash tables and the use of the C++ Standard Template Library (STL) containers, while addressing performance considerations and potential pitfalls.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes using a hash table to map hash values of parameters to indices in a data array, addressing hash collisions by verifying values at the indices.
  • Another participant mentions the proposal of adding hash_map to the STL, noting that it was included in C++11 as std::unordered_map, based on a Boost template.
  • A suggestion is made to use a large array of addresses for small data sizes, initializing it with invalid entries to facilitate quick access.
  • A participant discusses creating an index-chain array for hash values, which improves search speed compared to previous methods.
  • Concerns are raised about the efficiency of hash maps, particularly regarding non-uniform hashing leading to increased collisions and the impact of linked-list traversal on performance.
  • Another participant warns about potential licensing issues with different versions of gcc, suggesting caution when upgrading from older versions.
  • A code snippet is provided for using hash_map with GCC, including custom hash functions for STL strings and pointers.

Areas of Agreement / Disagreement

Participants express a range of strategies and concerns regarding associative arrays, with no clear consensus on the best approach or implementation. Discussions highlight both the advantages and limitations of various methods, indicating ongoing debate.

Contextual Notes

Some participants note the importance of considering the size of data and the efficiency of hashing algorithms, while others mention the implications of using linked-list structures in hash maps. There is also a discussion about the trade-offs between hash maps and binary trees for specific use cases.

lpetrich
Science Advisor
Messages
998
Reaction score
180
Associative Arrays in C/C++ -- Favorite Strategies?

I mean by that something like what Perl or Python or Ruby have, a.k.a. dictionary types. I can do it in Mathematica with the help of label[key] = value and Head and DownValues, but it's C/C++ where it is the most difficult of all.


I once had the task of searching an array with elements {params, data} -- find data for some params value. I used a speedup that consisted of making a hash table of
(hash value for params) -> (index in data array)

I got around the hash-collision problem by checking the element at the data-array index. If it has the right params value, then I've succeeded, but if not, then I'd search for params in the data array and update the hash table with the index that I found.


Most recently, in my Lie-algebra code, I've used the C++ Standard Template Library container "map". I wanted to see if some calculated root vector was already in some list of them, so I created a map object that contains the listed root vectors. Searching in the map object requires doing a compare operation, but I found a way to speed it up relative to an element-by-element compare.

When I add a root vector to the map object, I calculate and include a hash code for it. When it comes time to compare, I first compare the hash codes, then if they are equal, I do an element-by-element comparison. This strategy produces a rather noticeable speedup.


In summary, do a fast lookup or compare, and if that fails or returns equal, then do a slow but accurate one.


Are these reasonable strategies?
 
Technology news on Phys.org


phiby said:
There is a proposal to add hash_map to the STL - however, I am not sure if it's been accepted or not?

Yep, it's in C++11 as the class template std::unordered_map, which was pretty much taken verbatim from the Boost class template of the same name (but of course it's in namespace Boost in Boost rather than in namespace std).
 


If the data size is small enough and ram size is large enough, you can use data as an index to a huge array of addresses to create an associatve array. Initializing the entire array with "invalid address" entries (use a flag or some invalid address value) is time consuming, but access time is quick.

Otherwise you use some type of hash and/or search algorithm as already mentioned.
 
Last edited:


Thanx. Seems like I'm on the right track.

I've been using OSX Snow Leopard's out-of-the-box gcc, which is 4.2. However, I can get gcc up to 4.6 on the OSX package manager fink.

I now recall that in some recent hash-array code, I'd created an index-chain array, where the value for each index is another index whose data has the same hash value. The end of a chain is marked out with NONE (-1). That speeds up searching compared to my earlier effort.

In both of them, I'd used a fixed-size hash array, and a power of 2 at that. I think that if I made the hash table variable-sized, then I'd have to recalculate it every time I changed its size.I also checked on gcc's implementation of STL's associative containers. Those objects use a red-black binary tree.

For N elements, the search time:
Binary tree: log(N)
Linear-search hash table: N/H (H = hash-table size)

If one creates a hash table with a binary tree for each hash value, the search time would become
log(N/H) = log(N) - log(H)
 


lpetrich said:
If one creates a hash table with a binary tree for each hash value, the search time would become
log(N/H) = log(N) - log(H)

You have to be careful with hash-maps because if the hashing algorithm is not uniform then you will get more collisions in certain areas depending on your input data (keys).

The other thing is that along with the above, your map needs to take into account the number of entries because whenever you do get a collision, what will happen then is that it will traverse usually a linked-list structure to get the actual structure pointer which means that now you need to consider that not only are you now searching in linear Big O time, but since it is a linked-list you will have indirection which means that it will not be as fast as a normal array and that the compiler can not optimize the code due to the nature of your bin structure being a linked list.

Even if you used something like a binary-tree for the internal bin structure, you still have indirection and the point of the hash-map is to reduce the search and lookup time of an object to something resembling O(C) and not O(logn): if you want to use something more resembling the latter, then just use a binary tree and forget about the hash-map.
 


lpetrich said:
Thanx. Seems like I'm on the right track.

I've been using OSX Snow Leopard's out-of-the-box gcc, which is 4.2. However, I can get gcc up to 4.6 on the OSX package manager fink.
If you are using 4.2.1, you can use the TR1 (a C++ evaluation committee) version of unordered_map via #include <tr1/unordered_map>.

Word of warning: There's a good reason Snow Leopard comes with a five year old version of gcc. gcc 4.2.1 was the last version of gcc released under GPL version 2. All subsequent releases, including 4.6, were released under GPL version 3. This shouldn't be a problem if you are a student. If you are working for a living, you might want to think hard about using gcc 4.6. IMHO, stay far, far away from gcc 4.6. Release your application as a binary compiled with gcc 4.6 and boom! your work just because public domain.

You can use fink to get the Boost libraries. Boost was released with a reasonable usage license. The Boost are incredibly powerful. Getting Boost is well worth looking into.
 


You can also just use hash map directly with the GCC you already have. This is what I use, it works fine with the gcc in xcode 3.x:

Code:
#include <ext/hash_map>
using namespace ::__gnu_cxx;
namespace __gnu_cxx {                                                                                             
	template<> struct hash< std::string > // Allow STL strings with hash_map
	{ size_t operator()( const std::string& x ) const { return hash< const char* >()( x.c_str() ); } };          
	template<> struct hash< void * > // Allow pointers with hash_map               
	{ size_t operator()( const void * x ) const { return hash< uintptr_t >()( (uintptr_t)x ); } };          
}

(I'm not totally sure, now that I think about it, what this code will do when you run it in clang.)

I think if you try to install your own GCC on Snow Leopard, you will regret it.
 

Similar threads

  • · Replies 25 ·
Replies
25
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
20
Views
2K
  • · Replies 133 ·
5
Replies
133
Views
12K
  • · Replies 29 ·
Replies
29
Views
4K
  • · Replies 17 ·
Replies
17
Views
3K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K