Why are hash maps constant time?

In summary: No, BS is saying that TRAVERSING a vector OF INTEGERS (in order to find the insertion point) is often faster than traversing a linked list, and that this efficiency dominates the theoretically less efficient process of insertion, where FOR INTEGERS the inefficiency is mitigated by cache hits.
  • #1
ergospherical
966
1,276
When analysing the time-complexity of an algorithm I was told to assume that a hash map is ##0(1)##. A cursory google search afterward revealed that this is true because the time complexity of the hash function is independent of the size of the hash table. That seems alright but if you have a finite number of bins ##B##, then if the number of elements ##n > B##, by the pidgeonhole principle at least one bin will be mapped to by more than one key - in which case the time complexity would then be ##0(n \log n)## right? And that's not even taking into account stuff like, say, the hashing function depending in some manner (say linearly) on the length of the key, et cetera. Why does a hash map turn out to be just ##0(1)##?
 
Technology news on Phys.org
  • #2
The basic algorithm of a hashing function does not have ##n \gt B##, When it is true that ##n \gt B## then there are a variety of methods to handle it and they may have a variety of execution-time characteristics. Therefore, you were told to assume that a hash map is ##0(1)##. But that is just an assumption, not necessarily true in practice.
 
Last edited:
  • Like
Likes pbuk, ergospherical and jim mcnamara
  • #3
In addition to the load factor (size relative to number of bins) a complexity of ##O(1)## also assumes a perfect hash function, which is usually not a trivial characteristics to achieve in practice for all use cases.

Also, let me add that the notation is called big-O notation, with O being the letter O and not the digit 0.
 
  • Like
Likes Wrichik Basu, FactChecker, ergospherical and 1 other person
  • #4
Your intuition is right, they don't have a worst case time complxity of O(1) for lookups, they have an amortized time complexity of O(1).
 
Last edited:
  • Like
Likes ergospherical and pbuk
  • #5
Welcome to the minefield of algorithmic complexity. There are two important lessons which are generally only presented at the end of an introductory course, if at all.
  1. What we are often interested in for practical applications is how long on average does it take to compute a solution for a typical input - this is the amortized time complexity to which @Jarvis323 refers above (although abusing the notation somewhat). Unfortunately all of the three common measures of algorithmic complexity are based on worst case run time.
  2. Even when we calculate theoretical algorithmic complexity for a non-trivial algorithm we make assumptions which are only valid within certain limits. For instance we sometimes assume that addition, or even multiplication, of integers is ## \Theta(1) ##. If we are talking about 32- or 64-bit integers then this may be true, however if we allow the domain to be unbounded then clearly this is not true. It gets even worse when we apply this to a practical application: an algorithm which is ## \Theta(n \log n) ## will take time proportional to ## n \log n ## only while the computation fits in L1 cache and then there will be discontinuity with a new constant of proportionality for L2 cache etc.
There is a useful discussion of this in the context of the ## O(1) ## assumption for a hash table at http://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/lec20-amortized/amortized.htm.
 
  • Like
  • Informative
  • Love
Likes jbunniii, Wrichik Basu, FactChecker and 2 others
  • #6
Time complexity and big o notation is just an approximation. The way we teach it is broken as it assumes 1960s hardware that doesn't resemble the computers of today. On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline.
 
  • Informative
Likes anorlunda
  • #7
kalabren said:
On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline.
I don't think this is correct. Evidence or reference?
 
  • #9
No, BS is saying that TRAVERSING a vector OF INTEGERS (in order to find the insertion point) is often faster than traversing a linked list, and that this efficiency dominates the theoretically less efficient process of insertion, where FOR INTEGERS the inefficiency is mitigated by cache hits.

In the real world we are hardly ever dealing with lists of integers, and we often have much larger data structures, or entities that cannot be serialized to a fixed-length record at all. The post you linked considers this, showing that for the combined search and insert operation in the context tested, a vector was about twice as fast for 100,000 8-byte records, but for 100,000 128-byte records it was about 50% slower and for 100,000 4K records the vector was more than 20x slower than the list. Where the records are non-trivial data types (so they cannot simply be serialized into a fixed-length array element), again the vector is 20x slower for 100,000 records (the 'random_insert - 16 bytes' chart immediately before the 'Random Remove' section).

So rather than "On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline" I think we can say that "On a machine with a cache that is larger than the dataset in question, the combined process of insertion of a record preceded by searching for an insertion point can be significantly faster for a vector than for a list, however for large datasets or where moving data is a non-trivial operation then the vector performance can be many times worse than a list."
 
Last edited:
  • Like
Likes anorlunda
  • #10
The real lesson from this is that profiling is essential if you really care about efficiency, and that we shouldn't trust the textbook like it was dogma.

As a sidebar, I would't put too much trust in the exact values in that report, since it was run on 2012 hardware. However it does demonstrate the key point.
 
  • #11
pbuk said:
I don't think this is correct. Evidence
I think it's correct but may be misleading, because computational complexity is mentioned but not relevant. Vectors are smaller and access is more predictable and hardware takes advantage of that.
 
  • #12
Vanadium 50 said:
I think it's correct...
Then I am afraid you are wrong. Insertion is almost always slower for vectors but searching followed by insertion may in certain circumstances be quicker for vectors.
 
  • #13
pbuk said:
wrong

pbuk said:
almost always
I don't think both of these can be true.

All I am saying is you tell me shat behavior you want to see, and I'll show you an edge case that does it.
 
  • #14
kalabren said:
All I am saying is you tell me shat behavior you want to see, and I'll show you an edge case that does it.
Yes of course, but any number of edge cases do not prove an "often", which is the statement I am refuting.
 

1. Why are hash maps considered to be constant time?

Hash maps are considered constant time because the time taken to retrieve a value from a hash map does not depend on the size of the map. It only depends on the number of keys in the map, which is typically much smaller than the actual number of elements in the map. This allows for efficient retrieval of values, making hash maps a popular data structure for storing and retrieving data.

2. How does a hash map achieve constant time lookup?

A hash map achieves constant time lookup by using a hash function to map keys to specific indexes in an array. This allows for direct access to the value associated with a key, without having to iterate through the entire map. As long as the hash function is well-designed and the array is large enough, the time taken to retrieve a value will remain constant regardless of the size of the map.

3. Can hash maps ever have a lookup time that is not constant?

In theory, yes, hash maps can have a lookup time that is not constant. This can happen if the hash function used is poorly designed and results in a high number of collisions, or if the array used to store the values is not large enough to accommodate all the values. In practice, however, hash maps are designed to minimize the chances of these scenarios, making their lookup time effectively constant.

4. Are there any disadvantages to using hash maps?

While hash maps have many advantages, there are also some disadvantages to using them. One major disadvantage is that they do not preserve the order of elements, so the order in which elements are inserted into the map may not be the same as the order in which they are retrieved. Additionally, hash maps can use a significant amount of memory, especially if the hash function results in many collisions.

5. Can hash maps be used for all types of data?

Hash maps can be used for any type of data, as long as the data can be converted into a unique key. This means that the data must be hashable, which typically includes primitive data types like integers, strings, and booleans, as well as custom objects that implement the necessary methods for hashing. Some programming languages also provide built-in support for hash maps with more complex data types, such as arrays and dictionaries.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
1
Views
957
  • Quantum Physics
Replies
1
Views
706
  • Programming and Computer Science
Replies
9
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
6
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
4
Views
2K
  • Programming and Computer Science
Replies
1
Views
1K
Replies
9
Views
1K
  • Programming and Computer Science
Replies
4
Views
914
  • Programming and Computer Science
Replies
1
Views
1K
  • Programming and Computer Science
Replies
3
Views
2K
Back
Top