sean_s said:
Homework Statement
- Let 0≤p≤1.
- Let there be k distinct numbers (they can be natural numbers) a1, a2, ... , ak, each repeating respectively b1, b2, ... , bk times.
- Let q < ∑r=1k br
Determine the minimal values of b
1 ... b
k such that the probability of q numbers chosen out of ∑
r=1k b
r numbers being exactly the same is p.
Example : consider 1,1,1,1,2 . The probability that 2 numbers chosen out of the 5 is exactly the same is 6/10
Homework Equations
The only ones I can come up with is
and
- p = ∑r=1k qCb_r where q < br / qCh
C stands for Combination , in contrast to permutation.
The Attempt at a Solution
[/B]
Technically this is not a homework assignment, but needed for a project where a software needs to generate pseudo-random numbers, with some properties.
I notice that we have too many variables and just two equations. But a minimal solution should be possible. I personally am stuck, but it looks like I am missing / forgetting some basic properties of probability - with which this problem can be easily solved.. I could apply brute force. The example was generated via brute force. But I am sure a more formal solution is available.
This is why I classify this as precalculus homework..
Please help. My background is Physics MSc - so don't hesitate to apply more advanced notation on the solution than precalculus, if needed,.
For sampling without replacement (i.e., once a number is picked it becomes unavailable to be picked again) you need to have at least one of the ##b_i \geq q## in order to have any chance that you can get ##q## numbers the same. For sampling with replacement, this need not be the case, because with replacement you "put back" a number after it has been selected.
I get 6/10 in your 11112 example, but only if the sampling is without replacement.
There is a way of performing your calculations, but it is very computer-intensive. Let's say that
all your ##b_i## are ##\geq q##. It was not stated clearly how many numbers you are choosing---you only said you wanted ##q## the same, so the number ##n## picked must be at least ##q##. So, we pick ##n## numbers
without replacement. For each ##i = 1,2, \ldots, k##, let ##E_i## be the event that number ##i## is repeated exactly ##q## times in your sample of size ##n##. You want to know something like the probability that at least one of the events ##E_1, E_2, \ldots, E_k## occurs. However, this is still not sufficiently well-defined. Do you want just
one of ##E_1, E_2, \ldots, E_k## to occur (so if ##q = 3## and you have 111, you cannot have 222 also, etc.), or do you allow several occurrences of ##q## identical numbers (so it would be OK to have both 111 and 222, etc.)? Each of these two cases can be treated using versions of the so-called inclusion-exclusion principle, but the final details are different in the two versions.
Let
$$\begin{array}{rcl}
S_1&=& \sum_{i=1}^j P(E_i) \\
S_2 &=& \sum_{i,j: \: i < j} P(E_i E_j) \\
S_3 &=&\sum_{i,j,l: \: i < j < l} P(E_i E_j E_l) \\
&\vdots& \\
S_k&=& P(E_1 E_2 \cdots E_n)
\end{array}
$$
(Here, ##P(AB)## means ##P(A\: \text{and}\: B)##, etc.)
For example, if ##k=4## we have
$$\begin{array}{rcl}
S_1 &=& P(E_1) + P(E_2) + P(E_3) + P(E_4)\\
S_2&=&P(E_1 E_2) + P(E_1 E_3) + P(E_1 E_4) + P(E_2 E_3) + \cdots + P(E_3 E_4) \\
S_3 &=& P(E_1 E_2 E_3) + P(E_1 E_2 E_4) + \cdots + P(E_2 E_3 E_4)\\
S_4 &=& P(E_1 E_2 E_3 E_4)
\end{array}
$$
If we let ##E_{{}\geq 1}## denote the event that at least one of the ##E_i## occur and ##E_{=1}## the event that exactly one of the ##E_i## occur, we have:
$$P(E_{\geq 1}) = S_1 - S_2 + S_3 - \cdots \pm S_k $$
and
$$P(E_{=1}) = S_1 - 2 S_2 + 3 S_3 - \cdots \pm k S_k$$
OK, so how do we compute the ##S_1, S_2, \ldots, S_k##?
To compute ##P(E_1)## we categorize the outcomes as "1" and "not 1". The probability of getting exactly ##q## 1s in ##n## picks is given by the hypergeometric probability
$$P(E_1) = \frac{C(b_1,q) C(B_1,n-q)}{C(N,n)}$$
Here, ##N = b_1 + b_2 + \cdots + b_k##, ##B_i = N - b_i## for each ##i##, and ##C(a,b)## is the binomial coefficient ##{}^aC_b##. Similarly for the other ##P(E_i)##, so we have
$$S_1 = \sum_{i=1}^k \frac{C(b_i,q)C(B_i,n-q)}{C(N,n)} $$
To compute ##P(E_1 E_2)##, categorize the outcomes into three classes, "1", "2" and "neither 1 nor 2", which contain ##b_1, b_2## and ##B_{12} = N - b_1 - b_2## members. Now we have a 3-class hypergeometric distribution, which is less familiar than the classic hypergeometric, but is still easy enough to get (from conditional probabilities, for example):
$$P(E_1 E_2 E_3) = \frac{C(b_1,q) C(b_2,q) C(B_{12}, n - 2q)}{C(N,n)},$$
etc.
So, in principle we can compute ##S_2##.
In a similar way we have
$$P(E_i E_j E_l) = \frac{C(b_i,q) C(b_j,q) C(b_l,q) C(N-b_i-b_k-b_l, n-3q)}{C(N,n)}, $$
etc.
Of course, in any practical example, some of the ##P(E_i E_j E_l \cdots E_r)## may be zero, just because ##qr > n## for example.
So, it is all computable in principle, but it will not be pleasant. Best of luck.