What is the Probability of Sampling a Set with a Certain Size?

  • Context: Graduate 
  • Thread starter Thread starter yahastu
  • Start date Start date
  • Tags Tags
    Probability Set
Click For Summary

Discussion Overview

The discussion revolves around calculating the probability of obtaining a certain number of unique samples (C) when randomly sampling M items from an initial set (S1) of size N with replacement. Participants explore both analytical and numerical approaches to derive a solution for the value of X such that C > X with 95% probability, addressing the implications of different ratios of M to N.

Discussion Character

  • Exploratory, Technical explanation, Mathematical reasoning, Debate/contested

Main Points Raised

  • One participant poses a problem regarding the probability of unique samples when sampling with replacement, seeking a fast analytical solution.
  • Another participant suggests that the relationship between M and N is crucial, indicating that if M is much smaller than N, the expected number of unique samples will be close to M, while if M is approximately N/2, the outcome will differ significantly.
  • A numerical simulation is mentioned, revealing a linear relationship between M/N and percentiles of the distribution of C, leading to an approximate formula for X.
  • One participant introduces the expected number of unique hits after m samples, proposing a recurrence relation to compute this value numerically.
  • Another participant questions whether the relevant random variable can be modeled as a sum of independent geometric random variables, defining a new variable Y based on this concept.
  • A detailed recursive approach is presented to compute the probability distribution for the number of distinct members in the sample after M samples, including initial conditions and examples of applying the recursion.
  • A participant discusses a differential equation approach to model the number of distinct samples, deriving a relationship that aligns with earlier numerical findings.

Areas of Agreement / Disagreement

Participants express various approaches and insights, but there is no consensus on a definitive analytical solution. Multiple competing views and methods remain present throughout the discussion.

Contextual Notes

The discussion involves complex mathematical relationships and assumptions regarding the sampling process, which may not be fully resolved or universally applicable across different scenarios.

yahastu
Messages
79
Reaction score
7
I've come up with a problem that's important for an algorithm I'm developing, and I don't know how to solve it. Wondering if anyone here can help?

I have an initial set S1 of size N.

S2 is created by randomly sampling M samples from S1 with replacement (meaning the same item could be selected multiple times).

Let C = |S2| be the size of S2 (ie, the number of unique elements that get sampled).

For what value of X can we say that C > X with 95% probability?

It would be easy to solve this problem through numerical simulation, but I need a fast analytical solution (or approximate).

Any ideas?
 
Physics news on Phys.org
Good approximate solutions probably require knowing what M is relative to N. If M<<N then M is going to be your best guess (you'll never grab the same item twice), if ##M\approx N/2## you're going to get a very different answer.
 
Office_Shredder said:
Good approximate solutions probably require knowing what M is relative to N. If M<<N then M is going to be your best guess (you'll never grab the same item twice), if ##M\approx N/2## you're going to get a very different answer.

I did a numerical simulation, and found that the relationship between M/N and any given percentile of the distribution of C is linear. Using a linear regression I computed that the solution to X above is approx given by:

X = M( 0.922954 - 0.3144(M/N) )

So this is cool, can't beat it in terms of efficiency, although I have absolutely no clue how one would go about deriving this relationship analytically!
 
Let ##e_m## be the expected number of unique hits after ##m## drops. Then roughly, ##e_{m+1}\approx e_m+(1-e_m/N)##. And of course ##e_1=1## so you could solve that recurrence relation numerically and see how it compares...
 
  • Like
Likes   Reactions: Stephen Tashi
Is the relevant random variable ##Y##, a sum of independent geometric random variables?

Let ##X_i## be the geometric random variable defined as the number of trials, up to an including the trial that attains the first "success" in a sequence of independent trials where the probability of success on each trial is ##\frac{N-i}{N}##

Define ##Y = \sum_{i=0}^M X_i ## Does a result ##Y = k## correspond to getting ##M## distinct samples after taking ##k## samples from a population of size ##N## using random sampling with replacement?
 
Office_Shredder said:
so you could solve that recurrence relation numerically and see how it compares...

Recursion is a good idea. We can apply it to recursively generate the relevant distributions.

Let ##N## be the population size, and let ##M## the number of samples taken. We can imagine the samples taken taken in a sequence. Let ##f_M()## be the probability distribution for the number of distinct populations members in the sample after M samples are taken. So ## f_M(k)## is the probability that after ##M## samples are taken that exactly ##k## distinct members of the population are in the sample. Let ##F_M(k)## be the cumulative distribution for ##f##.

Consider computing ##F_{M+1}(k+1)##. In order to have ##k+1## or fewer samples after the ##M+1## sample is taken we must be in one of the following mutually exclusive cases after the ##M##th sample was taken.

1) In the previous ##M## samples there are ##k## or fewer distinct members in the sample.

2) In the previous ##M## samples there are exactly ##k+1## distinct members in the sample

So we have the recursion relation

##F_{M+1}(k+1) = F_M(k) + (f_M(k+1))(\frac{k+1}{N})\ \ ## [eq. 1]

which is based on computing the probability of transitioning between each of the above cases to the situation of having ##k+1## or fewer distinct members after the ##M+1## sample is taken.We have the initial conditions
##f_1(1) = 1, F_1(1) = 1##
##F(k) =1 ## for ##k \ge M##
##f_M(k) = 0## for ##k > M## or ##k < 1##Examples:

Applying eq. 1 with ##M = 1## we have:

##F_2(1) = F_1(0) + f_1(1) \frac{1}{N} = 0 + (1)(\frac{1}{N}) = \frac{1}{N}##

##F_2(2) = F_1(1) + f_1(2)\frac{2}{N} = 1 + 0 = 1##

which implies ##f_2(1) = \frac{1}{N},\ f_2(2) = 1 - \frac{1}{N} = \frac{N-1}{N} ##

For ##M = 2## we have:

##F_3(1) = F_2(0) + (f_2(1))(\frac{1}{N}) = 0 + (\frac{1}{N})(\frac{1}{N}) = \frac{1}{N^2}##

##F_3(2) = F_2(1) + (f_2(2))(\frac{2}{N}) = \frac{1}{N} + (\frac{N-1}{N})(\frac{2}{N}) = \frac{3N-2}{N^2}##

##F_3(3) = F_2(2) + f_2(3)(\frac{3}{N}) = 1 + (0)(\frac{3}{N}) = 1 ##

which implies

##f_3(1) = \frac{1}{N^2}##
## f_3(2) = \frac{3N-2}{N^2} - \frac{1}{N^2} = \frac{2N-2}{N^2}##
## f_3(3) = 1 - \frac{3N-2}{N^2} = \frac{N^2 - 3N + 2}{N^2}##

I don't know if pursuing the symbolic algebra implied by the recursion works out in a nice way. At least the recursion implies a numerical algorithm for generating ##F_M##.
 
Anyway, assuming each step is small, and changing e to f for reasons that will be obvious, so f(m) is the number of buckets filled after m drops

##f_{m+1}-f_m \approx 1-f_m/N##

Turns into a separable differential equation

##df/dm = 1-f/N##.

This is separable,
##df/(1-f/N) = dm##
So,
##-ln(|1-f/N|)= m+C##
For some constant C, which turns into

##1-f/N =Be^{-m}##
For some constant B. Solving for f yields
##f=N-NBe^{-m}##.

We know f(0)=0 since we haven't dropped anything yet so ##B=1##. Then ##f(N) = N-N/e##. 1-1/e is about 0.632 which matches pretty closely the linear regression value of 0.922954 - 0.3144 from the post above.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K