Expected number of collisions for hashing with chaining

Click For Summary
SUMMARY

The expected number of collisions in a hash table using chaining can be calculated using the formula 𝓔[Y] = n(n-1)/(2m), where n is the number of keys and m is the size of the hash table. The discussion highlights the importance of understanding the difference between the expected number of collisions for a single key and the total number of collisions across all keys. The incorrect reasoning in the initial attempt stemmed from miscalculating the total number of pairs of keys and not properly accounting for the uniform distribution of hash values.

PREREQUISITES
  • Understanding of hash functions and their properties
  • Familiarity with probability theory, particularly random variables
  • Knowledge of combinatorial mathematics, specifically combinations
  • Basic concepts of hashing techniques, including chaining
NEXT STEPS
  • Study the principles of simple uniform hashing and its implications on collision probability
  • Learn about different collision resolution techniques in hash tables, such as open addressing
  • Explore advanced hashing algorithms and their performance metrics
  • Investigate the mathematical foundations of expected values in probability theory
USEFUL FOR

Mathematicians, computer scientists, software engineers, and anyone involved in data structure optimization and performance analysis of hash tables.

CGandC
Messages
326
Reaction score
34
Homework Statement
Suppose we use a hash function ## h ## to hash ## n ## distinct keys into an array ## T ## of length ## m ##. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of ## \{ \{ k ,l \} : k \neq l and h(k) = h(l) \} ## ?
Relevant Equations
- Knowledge of random variables, indicator random variables.
- Simple uniform hashing means that the probability that every two different keys hash to the same cell is the same, i.e. ## \mathbb{P}( h(k_i) = h(k_j) ) = \frac{1}{m} ##
Attempt:
Denote ## T## as the hash table, ## h ## as the hash function.
Denote ## n ## as the number of keys in the universe, and ## m ## as the size of hash table.

Order the set of keys from the universe as ## \{ k_1 ,..., k_n \} ## such that ## k_i \leq k_j ## where ## i \leq j ##.
Note that a collision occurs when we try to insert some key ## k ## into the table and it occurs that ## T[h(k)] \neq NULL ##.

Let ## X ## be a random variable denoting a key ( ## X(\omega) = \omega ## , ## \omega \in \Omega ## is a key ).
Let ## C(X) ## be a random variable denoting the number of collisions with ## X ##.

Let ## X_{ij} ## be an indicator random variable denoting the event that a collision between ## k_i ## and ## k_j ## occurred, i.e. whether ## h(k_i) = h(k_j) ## occurred or not.

Thus, ## \mathbb{E}[ C(X)] = \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] \mathbb{P}(X=i) ## ## = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] ##
( where ## \mathbb{P}(X=i) = \frac{1}{n} ## under the assumption that the element being collided is equally likely to be any of the ## n ## elements stored in the table ).

Note that ## C(X) | X=i ## is a random variable and ## C(X) | X=i ~~~~ = \sum_{i \neq j = 1 }^n X_{ij} ##.

Thus, ## \mathbb{E}[ C(X) | X=i] = \sum_{i \neq j = 1 }^n \mathbb{E}[ X_{ij}] = \sum_{i \neq j = 1 }^n \mathbb{P}( h(k_i) = h(k_j) ) ## ## = \sum_{i \neq j = 1 }^n \frac{1}{m} = \frac{n-1}{m} ##

* The sum went over all possible values of ## j ## besides the case where ## j = i ##, hence the ## n-1 ## in the sum .
* ## \mathbb{P}( h(k_i) = h(k_j) ) = \frac{1}{m} ## because of the assumption of simple uniform hashing .

Thus, ## \frac{1}{n} \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] = \frac{1}{n} \cdot \frac{n(n-1)}{m} = \frac{n-1}{m} ##.
Hence, ## \mathbb{E}[ C(X)] = \frac{n-1}{m} ##, which is what we wanted.

Correct answer:
Denote as ## Y ## a random variable representing total number of collisions.
Note that ## Y= \sum_{i \neq j }^n X_{ij} ##.
Thus, ## \mathbb{E}[ Y] =\sum_{i \neq j }^n \mathbb{E}[ X_{ij}] = {n \choose 2} \cdot \frac{1}{m} = \frac{n(n-1)}{2m} ##

Question:
Obviously my answer is not correct since ## \frac{n-1}{m} \neq \frac{n(n-1)}{2m} ##, and I understood the correct answer and I agree with it, but I don't understand why my lines of reasoning in the attempt didn't yield the correct answer, what was wrong in my attempt that led to incorrect result?

Thanks in advance for any help!
 
Physics news on Phys.org
CGandC said:
what was wrong in my attempt that led to incorrect result?

You disappeared off into the woods after

CGandC said:
Let ## X_{ij} ## be an indicator random variable denoting the event that a collision between ## k_i ## and ## k_j ## occurred, i.e. whether ## h(k_i) = h(k_j) ## occurred or not.

Clearly ## P(X_{ij} = 1) = \dfrac 1 m ##, so all we have to do is count the number of ## X_{ij} ##s and remember not to count the same collision twice.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
Replies
4
Views
2K
Replies
0
Views
892
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 5 ·
Replies
5
Views
3K
Replies
12
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K