Why are hash maps constant time?

ergospherical · Dec 20, 2021

When analysing the time-complexity of an algorithm I was told to assume that a hash map is ##0(1)##. A cursory google search afterward revealed that this is true because the time complexity of the hash function is independent of the size of the hash table. That seems alright but if you have a finite number of bins ##B##, then if the number of elements ##n > B##, by the pidgeonhole principle at least one bin will be mapped to by more than one key - in which case the time complexity would then be ##0(n \log n)## right? And that's not even taking into account stuff like, say, the hashing function depending in some manner (say linearly) on the length of the key, et cetera. Why does a hash map turn out to be just ##0(1)##?

FactChecker · Dec 20, 2021

The basic algorithm of a hashing function does not have ##n \gt B##, When it is true that ##n \gt B## then there are a variety of methods to handle it and they may have a variety of execution-time characteristics. Therefore, you were told to assume that a hash map is ##0(1)##. But that is just an assumption, not necessarily true in practice.

Filip Larsen · Dec 21, 2021

In addition to the load factor (size relative to number of bins) a complexity of ##O(1)## also assumes a perfect hash function, which is usually not a trivial characteristics to achieve in practice for all use cases.

Also, let me add that the notation is called big-O notation, with O being the letter O and not the digit 0.

Jarvis323 · Dec 21, 2021

Your intuition is right, they don't have a worst case time complxity of O(1) for lookups, they have an amortized time complexity of O(1).

pbuk · Dec 22, 2021

Welcome to the minefield of algorithmic complexity. There are two important lessons which are generally only presented at the end of an introductory course, if at all.

What we are often interested in for practical applications is how long on average does it take to compute a solution for a typical input - this is the amortized time complexity to which @Jarvis323 refers above (although abusing the notation somewhat). Unfortunately all of the three common measures of algorithmic complexity are based on worst case run time.
Even when we calculate theoretical algorithmic complexity for a non-trivial algorithm we make assumptions which are only valid within certain limits. For instance we sometimes assume that addition, or even multiplication, of integers is ## \Theta(1) ##. If we are talking about 32- or 64-bit integers then this may be true, however if we allow the domain to be unbounded then clearly this is not true. It gets even worse when we apply this to a practical application: an algorithm which is ## \Theta(n \log n) ## will take time proportional to ## n \log n ## only while the computation fits in L1 cache and then there will be discontinuity with a new constant of proportionality for L2 cache etc.

There is a useful discussion of this in the context of the ## O(1) ## assumption for a hash table at http://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/lec20-amortized/amortized.htm.

kalabren · Dec 27, 2021

Time complexity and big o notation is just an approximation. The way we teach it is broken as it assumes 1960s hardware that doesn't resemble the computers of today. On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline.

pbuk · Dec 27, 2021

kalabren said:

On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline.

I don't think this is correct. Evidence or reference?

kalabren · Dec 27, 2021

pbuk said:

I don't think this is correct. Evidence or reference?

Some additional content, some benchmarks I found on the first page of my search engine: https://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html

As always you should benchmark your application and hardware. Specific data overrides general principles.

pbuk · Dec 27, 2021

No, BS is saying that TRAVERSING a vector OF INTEGERS (in order to find the insertion point) is often faster than traversing a linked list, and that this efficiency dominates the theoretically less efficient process of insertion, where FOR INTEGERS the inefficiency is mitigated by cache hits.

In the real world we are hardly ever dealing with lists of integers, and we often have much larger data structures, or entities that cannot be serialized to a fixed-length record at all. The post you linked considers this, showing that for the combined search and insert operation in the context tested, a vector was about twice as fast for 100,000 8-byte records, but for 100,000 128-byte records it was about 50% slower and for 100,000 4K records the vector was more than 20x slower than the list. Where the records are non-trivial data types (so they cannot simply be serialized into a fixed-length array element), again the vector is 20x slower for 100,000 records (the 'random_insert - 16 bytes' chart immediately before the 'Random Remove' section).

So rather than "On most machines, O(n) insertion into a vector is often faster than O(1) insertion into a linked list, due to the cache and pipeline" I think we can say that "On a machine with a cache that is larger than the dataset in question, the combined process of insertion of a record preceded by searching for an insertion point can be significantly faster for a vector than for a list, however for large datasets or where moving data is a non-trivial operation then the vector performance can be many times worse than a list."

kalabren · Dec 27, 2021

The real lesson from this is that profiling is essential if you really care about efficiency, and that we shouldn't trust the textbook like it was dogma.

As a sidebar, I would't put too much trust in the exact values in that report, since it was run on 2012 hardware. However it does demonstrate the key point.

Vanadium 50 · Dec 28, 2021

pbuk said:

I don't think this is correct. Evidence

I think it's correct but may be misleading, because computational complexity is mentioned but not relevant. Vectors are smaller and access is more predictable and hardware takes advantage of that.

pbuk · Dec 28, 2021

Vanadium 50 said:

I think it's correct...

Then I am afraid you are wrong. Insertion is almost always slower for vectors but searching followed by insertion may in certain circumstances be quicker for vectors.

Vanadium 50 · Dec 28, 2021

pbuk said:

wrong

pbuk said:

almost always

I don't think both of these can be true.

All I am saying is you tell me shat behavior you want to see, and I'll show you an edge case that does it.

pbuk · Dec 28, 2021

kalabren said:

All I am saying is you tell me shat behavior you want to see, and I'll show you an edge case that does it.

Yes of course, but any number of edge cases do not prove an "often", which is the statement I am refuting.

Why are hash maps constant time?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

How to increase phone signal strength by lying about it

Who is responsible for the software when AI takes over programming?

Use of AI (ML/DL) in Science

Could the reason why I can't select any kernels in VS Code be this error?

Star maps using Blender

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight