What is the Probability of Sampling a Set with a Certain Size?

yahastu · Mar 31, 2021

I've come up with a problem that's important for an algorithm I'm developing, and I don't know how to solve it. Wondering if anyone here can help?

I have an initial set S1 of size N.

S2 is created by randomly sampling M samples from S1 with replacement (meaning the same item could be selected multiple times).

Let C = |S2| be the size of S2 (ie, the number of unique elements that get sampled).

For what value of X can we say that C > X with 95% probability?

It would be easy to solve this problem through numerical simulation, but I need a fast analytical solution (or approximate).

Any ideas?

Office_Shredder · Mar 31, 2021

Good approximate solutions probably require knowing what M is relative to N. If M<<N then M is going to be your best guess (you'll never grab the same item twice), if ##M\approx N/2## you're going to get a very different answer.

yahastu · Mar 31, 2021

Office_Shredder said:

Good approximate solutions probably require knowing what M is relative to N. If M<<N then M is going to be your best guess (you'll never grab the same item twice), if ##M\approx N/2## you're going to get a very different answer.

I did a numerical simulation, and found that the relationship between M/N and any given percentile of the distribution of C is linear. Using a linear regression I computed that the solution to X above is approx given by:

X = M( 0.922954 - 0.3144(M/N) )

So this is cool, can't beat it in terms of efficiency, although I have absolutely no clue how one would go about deriving this relationship analytically!

Office_Shredder · Mar 31, 2021

Let ##e_m## be the expected number of unique hits after ##m## drops. Then roughly, ##e_{m+1}\approx e_m+(1-e_m/N)##. And of course ##e_1=1## so you could solve that recurrence relation numerically and see how it compares...

Stephen Tashi · Apr 6, 2021

Is the relevant random variable ##Y##, a sum of independent geometric random variables?

Let ##X_i## be the geometric random variable defined as the number of trials, up to an including the trial that attains the first "success" in a sequence of independent trials where the probability of success on each trial is ##\frac{N-i}{N}##

Define ##Y = \sum_{i=0}^M X_i ## Does a result ##Y = k## correspond to getting ##M## distinct samples after taking ##k## samples from a population of size ##N## using random sampling with replacement?

Stephen Tashi · Apr 20, 2021

Office_Shredder said:

so you could solve that recurrence relation numerically and see how it compares...

Recursion is a good idea. We can apply it to recursively generate the relevant distributions.

Let ##N## be the population size, and let ##M## the number of samples taken. We can imagine the samples taken taken in a sequence. Let ##f_M()## be the probability distribution for the number of distinct populations members in the sample after M samples are taken. So ## f_M(k)## is the probability that after ##M## samples are taken that exactly ##k## distinct members of the population are in the sample. Let ##F_M(k)## be the cumulative distribution for ##f##.

Consider computing ##F_{M+1}(k+1)##. In order to have ##k+1## or fewer samples after the ##M+1## sample is taken we must be in one of the following mutually exclusive cases after the ##M##th sample was taken.

1) In the previous ##M## samples there are ##k## or fewer distinct members in the sample.

2) In the previous ##M## samples there are exactly ##k+1## distinct members in the sample

So we have the recursion relation

##F_{M+1}(k+1) = F_M(k) + (f_M(k+1))(\frac{k+1}{N})\ \ ## [eq. 1]

which is based on computing the probability of transitioning between each of the above cases to the situation of having ##k+1## or fewer distinct members after the ##M+1## sample is taken.We have the initial conditions
##f_1(1) = 1, F_1(1) = 1##
##F(k) =1 ## for ##k \ge M##
##f_M(k) = 0## for ##k > M## or ##k < 1##Examples:

Applying eq. 1 with ##M = 1## we have:

##F_2(1) = F_1(0) + f_1(1) \frac{1}{N} = 0 + (1)(\frac{1}{N}) = \frac{1}{N}##

##F_2(2) = F_1(1) + f_1(2)\frac{2}{N} = 1 + 0 = 1##

which implies ##f_2(1) = \frac{1}{N},\ f_2(2) = 1 - \frac{1}{N} = \frac{N-1}{N} ##

For ##M = 2## we have:

##F_3(1) = F_2(0) + (f_2(1))(\frac{1}{N}) = 0 + (\frac{1}{N})(\frac{1}{N}) = \frac{1}{N^2}##

##F_3(2) = F_2(1) + (f_2(2))(\frac{2}{N}) = \frac{1}{N} + (\frac{N-1}{N})(\frac{2}{N}) = \frac{3N-2}{N^2}##

##F_3(3) = F_2(2) + f_2(3)(\frac{3}{N}) = 1 + (0)(\frac{3}{N}) = 1 ##

which implies

##f_3(1) = \frac{1}{N^2}##
## f_3(2) = \frac{3N-2}{N^2} - \frac{1}{N^2} = \frac{2N-2}{N^2}##
## f_3(3) = 1 - \frac{3N-2}{N^2} = \frac{N^2 - 3N + 2}{N^2}##

I don't know if pursuing the symbolic algebra implied by the recursion works out in a nice way. At least the recursion implies a numerical algorithm for generating ##F_M##.

Office_Shredder · Apr 20, 2021

Anyway, assuming each step is small, and changing e to f for reasons that will be obvious, so f(m) is the number of buckets filled after m drops

##f_{m+1}-f_m \approx 1-f_m/N##

Turns into a separable differential equation

##df/dm = 1-f/N##.

This is separable,
##df/(1-f/N) = dm##
So,
##-ln(|1-f/N|)= m+C##
For some constant C, which turns into

##1-f/N =Be^{-m}##
For some constant B. Solving for f yields
##f=N-NBe^{-m}##.

We know f(0)=0 since we haven't dropped anything yet so ##B=1##. Then ##f(N) = N-N/e##. 1-1/e is about 0.632 which matches pretty closely the linear regression value of 0.922954 - 0.3144 from the post above.

What is the Probability of Sampling a Set with a Certain Size?

1. What is the definition of probability in the context of sampling a set?

2. How is the probability of sampling a set with a certain size calculated?

3. Does the size of the population affect the probability of sampling a set with a certain size?

4. How does the number of elements in the desired set affect the probability of sampling a set with a certain size?

5. Can the probability of sampling a set with a certain size be greater than 1?

Similar threads

Hot Threads

Recent Insights