Comp Sci Expected number of collisions for hashing with chaining

AI Thread Summary
The discussion focuses on calculating the expected number of collisions in a hash table using chaining. The initial attempt incorrectly concluded that the expected number of collisions is given by the formula E[C(X)] = (n-1)/m, while the correct formula is E[Y] = n(n-1)/(2m). The error arose from not properly accounting for the number of unique pairs of keys that can collide, which is represented by the combination formula. The correct approach involves summing the probabilities of collisions across all unique pairs, leading to the accurate expectation of collisions. Understanding the distinction between counting individual collisions and unique pairs is crucial for deriving the correct result.
CGandC
Messages
326
Reaction score
34
Homework Statement
Suppose we use a hash function ## h ## to hash ## n ## distinct keys into an array ## T ## of length ## m ##. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of ## \{ \{ k ,l \} : k \neq l and h(k) = h(l) \} ## ?
Relevant Equations
- Knowledge of random variables, indicator random variables.
- Simple uniform hashing means that the probability that every two different keys hash to the same cell is the same, i.e. ## \mathbb{P}( h(k_i) = h(k_j) ) = \frac{1}{m} ##
Attempt:
Denote ## T## as the hash table, ## h ## as the hash function.
Denote ## n ## as the number of keys in the universe, and ## m ## as the size of hash table.

Order the set of keys from the universe as ## \{ k_1 ,..., k_n \} ## such that ## k_i \leq k_j ## where ## i \leq j ##.
Note that a collision occurs when we try to insert some key ## k ## into the table and it occurs that ## T[h(k)] \neq NULL ##.

Let ## X ## be a random variable denoting a key ( ## X(\omega) = \omega ## , ## \omega \in \Omega ## is a key ).
Let ## C(X) ## be a random variable denoting the number of collisions with ## X ##.

Let ## X_{ij} ## be an indicator random variable denoting the event that a collision between ## k_i ## and ## k_j ## occured, i.e. whether ## h(k_i) = h(k_j) ## occured or not.

Thus, ## \mathbb{E}[ C(X)] = \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] \mathbb{P}(X=i) ## ## = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] ##
( where ## \mathbb{P}(X=i) = \frac{1}{n} ## under the assumption that the element being collided is equally likely to be any of the ## n ## elements stored in the table ).

Note that ## C(X) | X=i ## is a random variable and ## C(X) | X=i ~~~~ = \sum_{i \neq j = 1 }^n X_{ij} ##.

Thus, ## \mathbb{E}[ C(X) | X=i] = \sum_{i \neq j = 1 }^n \mathbb{E}[ X_{ij}] = \sum_{i \neq j = 1 }^n \mathbb{P}( h(k_i) = h(k_j) ) ## ## = \sum_{i \neq j = 1 }^n \frac{1}{m} = \frac{n-1}{m} ##

* The sum went over all possible values of ## j ## besides the case where ## j = i ##, hence the ## n-1 ## in the sum .
* ## \mathbb{P}( h(k_i) = h(k_j) ) = \frac{1}{m} ## because of the assumption of simple uniform hashing .

Thus, ## \frac{1}{n} \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] = \frac{1}{n} \cdot \frac{n(n-1)}{m} = \frac{n-1}{m} ##.
Hence, ## \mathbb{E}[ C(X)] = \frac{n-1}{m} ##, which is what we wanted.

Correct answer:
Denote as ## Y ## a random variable representing total number of collisions.
Note that ## Y= \sum_{i \neq j }^n X_{ij} ##.
Thus, ## \mathbb{E}[ Y] =\sum_{i \neq j }^n \mathbb{E}[ X_{ij}] = {n \choose 2} \cdot \frac{1}{m} = \frac{n(n-1)}{2m} ##

Question:
Obviously my answer is not correct since ## \frac{n-1}{m} \neq \frac{n(n-1)}{2m} ##, and I understood the correct answer and I agree with it, but I don't understand why my lines of reasoning in the attempt didn't yield the correct answer, what was wrong in my attempt that led to incorrect result?

Thanks in advance for any help!
 
Physics news on Phys.org
CGandC said:
what was wrong in my attempt that led to incorrect result?

You disappeared off into the woods after

CGandC said:
Let ## X_{ij} ## be an indicator random variable denoting the event that a collision between ## k_i ## and ## k_j ## occured, i.e. whether ## h(k_i) = h(k_j) ## occured or not.

Clearly ## P(X_{ij} = 1) = \dfrac 1 m ##, so all we have to do is count the number of ## X_{ij} ##s and remember not to count the same collision twice.
 
Back
Top