Why are hash maps constant time?

  • Thread starter Thread starter ergospherical
  • Start date Start date
  • Tags Tags
    Constant Time
Click For Summary

Discussion Overview

The discussion revolves around the time complexity of hash maps, specifically why they are often assumed to have constant time complexity, denoted as O(1). Participants explore various aspects of this assumption, including the implications of load factors, hash functions, and practical performance considerations. The conversation also touches on broader themes of algorithmic complexity and performance in different contexts.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants assert that the assumption of O(1) time complexity for hash maps is based on the independence of the hash function from the size of the hash table, although they acknowledge that this may not hold true when the number of elements exceeds the number of bins.
  • Others argue that the O(1) assumption relies on a perfect hash function, which is difficult to achieve in practice.
  • A participant mentions that the average time complexity for practical applications is often referred to as amortized time complexity, which may differ from worst-case scenarios.
  • Concerns are raised about the relevance of traditional algorithmic complexity measures, suggesting they may not accurately reflect modern hardware performance.
  • Some participants discuss the performance of data structures like vectors and linked lists, debating the conditions under which one may outperform the other in terms of insertion and traversal times.
  • There are claims that the efficiency of data structures can vary significantly based on the size and type of data being processed, highlighting the importance of profiling in performance assessments.

Areas of Agreement / Disagreement

Participants express differing views on the validity of the O(1) assumption for hash maps, with no consensus reached on the implications of load factors or the performance of different data structures. The discussion remains unresolved regarding the accuracy of claims about the efficiency of vectors versus linked lists.

Contextual Notes

Limitations in the discussion include assumptions about the behavior of hashing functions, the impact of data structure choice on performance, and the relevance of benchmarks that may not reflect current hardware capabilities.

Who May Find This Useful

Readers interested in algorithmic complexity, data structure performance, and practical applications of hashing may find this discussion relevant.

ergospherical
Science Advisor
Homework Helper
Education Advisor
Insights Author
Messages
1,100
Reaction score
1,387
When analysing the time-complexity of an algorithm I was told to assume that a hash map is ##0(1)##. A cursory google search afterward revealed that this is true because the time complexity of the hash function is independent of the size of the hash table. That seems alright but if you have a finite number of bins ##B##, then if the number of elements ##n > B##, by the pidgeonhole principle at least one bin will be mapped to by more than one key - in which case the time complexity would then be ##0(n \log n)## right? And that's not even taking into account stuff like, say, the hashing function depending in some manner (say linearly) on the length of the key, et cetera. Why does a hash map turn out to be just ##0(1)##?
 
Technology news on Phys.org
The basic algorithm of a hashing function does not have ##n \gt B##, When it is true that ##n \gt B## then there are a variety of methods to handle it and they may have a variety of execution-time characteristics. Therefore, you were told to assume that a hash map is ##0(1)##. But that is just an assumption, not necessarily true in practice.
 
Last edited:
  • Like
Likes   Reactions: pbuk, ergospherical and jim mcnamara
In addition to the load factor (size relative to number of bins) a complexity of ##O(1)## also assumes a perfect hash function, which is usually not a trivial characteristics to achieve in practice for all use cases.

Also, let me add that the notation is called big-O notation, with O being the letter O and not the digit 0.
 
  • Like
Likes   Reactions: Wrichik Basu, FactChecker, ergospherical and 1 other person
Your intuition is right, they don't have a worst case time complxity of O(1) for lookups, they have an amortized time complexity of O(1).
 
Last edited:
  • Like
Likes   Reactions: ergospherical and pbuk
Welcome to the minefield of algorithmic complexity. There are two important lessons which are generally only presented at the end of an introductory course, if at all.
  1. What we are often interested in for practical applications is how long on average does it take to compute a solution for a typical input - this is the amortized time complexity to which @Jarvis323 refers above (although abusing the notation somewhat). Unfortunately all of the three common measures of algorithmic complexity are based on worst case run time.
  2. Even when we calculate theoretical algorithmic complexity for a non-trivial algorithm we make assumptions which are only valid within certain limits. For instance we sometimes assume that addition, or even multiplication, of integers is ## \Theta(1) ##. If we are talking about 32- or 64-bit integers then this may be true, however if we allow the domain to be unbounded then clearly this is not true. It gets even worse when we apply this to a practical application: an algorithm which is ## \Theta(n \log n) ## will take time proportional to ## n \log n ## only while the computation fits in L1 cache and then there will be discontinuity with a new constant of proportionality for L2 cache etc.
There is a useful discussion of this in the context of the ## O(1) ## assumption for a hash table at http://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/lec20-amortized/amortized.htm.
 
  • Like
  • Informative
  • Love
Likes   Reactions: jbunniii, Wrichik Basu, FactChecker and 2 others
Time complexity and big o notation is just an approximation. The way we teach it is broken as it assumes 1960s hardware that doesn't resemble the computers of today. On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline.
 
  • Informative
Likes   Reactions: anorlunda
kalabren said:
On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline.
I don't think this is correct. Evidence or reference?
 
No, BS is saying that TRAVERSING a vector OF INTEGERS (in order to find the insertion point) is often faster than traversing a linked list, and that this efficiency dominates the theoretically less efficient process of insertion, where FOR INTEGERS the inefficiency is mitigated by cache hits.

In the real world we are hardly ever dealing with lists of integers, and we often have much larger data structures, or entities that cannot be serialized to a fixed-length record at all. The post you linked considers this, showing that for the combined search and insert operation in the context tested, a vector was about twice as fast for 100,000 8-byte records, but for 100,000 128-byte records it was about 50% slower and for 100,000 4K records the vector was more than 20x slower than the list. Where the records are non-trivial data types (so they cannot simply be serialized into a fixed-length array element), again the vector is 20x slower for 100,000 records (the 'random_insert - 16 bytes' chart immediately before the 'Random Remove' section).

So rather than "On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline" I think we can say that "On a machine with a cache that is larger than the dataset in question, the combined process of insertion of a record preceded by searching for an insertion point can be significantly faster for a vector than for a list, however for large datasets or where moving data is a non-trivial operation then the vector performance can be many times worse than a list."
 
Last edited:
  • Like
Likes   Reactions: anorlunda
  • #10
The real lesson from this is that profiling is essential if you really care about efficiency, and that we shouldn't trust the textbook like it was dogma.

As a sidebar, I would't put too much trust in the exact values in that report, since it was run on 2012 hardware. However it does demonstrate the key point.
 
  • #11
pbuk said:
I don't think this is correct. Evidence
I think it's correct but may be misleading, because computational complexity is mentioned but not relevant. Vectors are smaller and access is more predictable and hardware takes advantage of that.
 
  • #12
Vanadium 50 said:
I think it's correct...
Then I am afraid you are wrong. Insertion is almost always slower for vectors but searching followed by insertion may in certain circumstances be quicker for vectors.
 
  • #13
pbuk said:
wrong

pbuk said:
almost always
I don't think both of these can be true.

All I am saying is you tell me shat behavior you want to see, and I'll show you an edge case that does it.
 
  • #14
kalabren said:
All I am saying is you tell me shat behavior you want to see, and I'll show you an edge case that does it.
Yes of course, but any number of edge cases do not prove an "often", which is the statement I am refuting.
 

Similar threads

  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
9
Views
3K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K