Probability of Random Numbers

1. Jun 24, 2017

sean_s

1. The problem statement, all variables and given/known data
• Let 0≤p≤1.
• Let there be k distinct numbers (they can be natural numbers) a1, a2, ... , ak, each repeating respectively b1, b2, ... , bk times.
• Let q < ∑r=1k br
Determine the minimal values of b1 ... bk such that the probability of q numbers chosen out of ∑r=1k br numbers being exactly the same is p.

Example : consider 1,1,1,1,2 . The probability that 2 numbers chosen out of the 5 is exactly the same is 6/10

2. Relevant equations

The only ones I can come up with is
• h = ∑r=1k br
and
• p = ∑r=1k qCb_r where q < br / qCh
C stands for Combination , in contrast to permutation.

3. The attempt at a solution

Technically this is not a homework assignment, but needed for a project where a software needs to generate pseudo-random numbers, with some properties.

I notice that we have too many variables and just two equations. But a minimal solution should be possible. I personally am stuck, but it looks like I am missing / forgetting some basic properties of probability - with which this problem can be easily solved.. I could apply brute force. The example was generated via brute force. But I am sure a more formal solution is available.

This is why I classify this as precalculus homework..

Please help. My background is Physics MSc - so don't hesitate to apply more advanced notation on the solution than precalculus, if needed,.

2. Jun 24, 2017

FactChecker

It seems as though the probability will always be a rational number. If p is irrational, the probability can never equal p. If you assume that p is rational, then maybe you can use its rational representation, p = m/n, to figure something out.

3. Jun 24, 2017

Ray Vickson

For sampling without replacement (i.e., once a number is picked it becomes unavailable to be picked again) you need to have at least one of the $b_i \geq q$ in order to have any chance that you can get $q$ numbers the same. For sampling with replacement, this need not be the case, because with replacement you "put back" a number after it has been selected.

I get 6/10 in your 11112 example, but only if the sampling is without replacement.

There is a way of performing your calculations, but it is very computer-intensive. Let's say that all your $b_i$ are $\geq q$. It was not stated clearly how many numbers you are choosing---you only said you wanted $q$ the same, so the number $n$ picked must be at least $q$. So, we pick $n$ numbers without replacement. For each $i = 1,2, \ldots, k$, let $E_i$ be the event that number $i$ is repeated exactly $q$ times in your sample of size $n$. You want to know something like the probability that at least one of the events $E_1, E_2, \ldots, E_k$ occurs. However, this is still not sufficiently well-defined. Do you want just one of $E_1, E_2, \ldots, E_k$ to occur (so if $q = 3$ and you have 111, you cannot have 222 also, etc.), or do you allow several occurrences of $q$ identical numbers (so it would be OK to have both 111 and 222, etc.)? Each of these two cases can be treated using versions of the so-called inclusion-exclusion principle, but the final details are different in the two versions.

Let
$$\begin{array}{rcl} S_1&=& \sum_{i=1}^j P(E_i) \\ S_2 &=& \sum_{i,j: \: i < j} P(E_i E_j) \\ S_3 &=&\sum_{i,j,l: \: i < j < l} P(E_i E_j E_l) \\ &\vdots& \\ S_k&=& P(E_1 E_2 \cdots E_n) \end{array}$$
(Here, $P(AB)$ means $P(A\: \text{and}\: B)$, etc.)
For example, if $k=4$ we have
$$\begin{array}{rcl} S_1 &=& P(E_1) + P(E_2) + P(E_3) + P(E_4)\\ S_2&=&P(E_1 E_2) + P(E_1 E_3) + P(E_1 E_4) + P(E_2 E_3) + \cdots + P(E_3 E_4) \\ S_3 &=& P(E_1 E_2 E_3) + P(E_1 E_2 E_4) + \cdots + P(E_2 E_3 E_4)\\ S_4 &=& P(E_1 E_2 E_3 E_4) \end{array}$$

If we let $E_{{}\geq 1}$ denote the event that at least one of the $E_i$ occur and $E_{=1}$ the event that exactly one of the $E_i$ occur, we have:
$$P(E_{\geq 1}) = S_1 - S_2 + S_3 - \cdots \pm S_k$$
and
$$P(E_{=1}) = S_1 - 2 S_2 + 3 S_3 - \cdots \pm k S_k$$

OK, so how do we compute the $S_1, S_2, \ldots, S_k$?

To compute $P(E_1)$ we categorize the outcomes as "1" and "not 1". The probability of getting exactly $q$ 1s in $n$ picks is given by the hypergeometric probability
$$P(E_1) = \frac{C(b_1,q) C(B_1,n-q)}{C(N,n)}$$
Here, $N = b_1 + b_2 + \cdots + b_k$, $B_i = N - b_i$ for each $i$, and $C(a,b)$ is the binomial coefficient ${}^aC_b$. Similarly for the other $P(E_i)$, so we have
$$S_1 = \sum_{i=1}^k \frac{C(b_i,q)C(B_i,n-q)}{C(N,n)}$$
To compute $P(E_1 E_2)$, categorize the outcomes into three classes, "1", "2" and "neither 1 nor 2", which contain $b_1, b_2$ and $B_{12} = N - b_1 - b_2$ members. Now we have a 3-class hypergeometric distribution, which is less familiar than the classic hypergeometric, but is still easy enough to get (from conditional probabilities, for example):
$$P(E_1 E_2 E_3) = \frac{C(b_1,q) C(b_2,q) C(B_{12}, n - 2q)}{C(N,n)},$$
etc.
So, in principle we can compute $S_2$.

In a similar way we have
$$P(E_i E_j E_l) = \frac{C(b_i,q) C(b_j,q) C(b_l,q) C(N-b_i-b_k-b_l, n-3q)}{C(N,n)},$$
etc.

Of course, in any practical example, some of the $P(E_i E_j E_l \cdots E_r)$ may be zero, just because $qr > n$ for example.

So, it is all computable in principle, but it will not be pleasant. Best of luck.

Last edited: Jun 24, 2017