Probability calculation involving very large numbers

Click For Summary

Discussion Overview

The discussion revolves around calculating probabilities related to a sampling problem involving a large population of 200,000,000 individuals, focusing on a specific subset of 60,000 individuals. Participants explore how to compute the probabilities of selecting certain numbers of these individuals from a random sample of 10,000,000, addressing challenges posed by the large numbers involved.

Discussion Character

  • Exploratory
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant, Matt, outlines a problem involving the calculation of probabilities for selecting red and white marbles from a large population, expressing uncertainty about how to approach the calculations for large numbers.
  • Another participant suggests starting with the probability of selecting one marble and building up to larger samples, questioning how the probabilities change with increasing sample sizes.
  • Some participants propose using binomial coefficients and the Hypergeometric distribution to express the probabilities, defining variables for clarity.
  • One participant provides specific calculations for the probabilities, indicating that the probability of selecting no red marbles is extremely low, while the probability of selecting at least 40 red marbles is very high, and the probability of selecting all 60,000 red marbles is also extremely low.
  • There is mention of using logarithmic transformations to handle the large numbers involved in the calculations, suggesting a method to evaluate probabilities without direct computation of very small values.
  • Alternative approaches, such as approximating the Hypergeometric distribution with a Normal distribution, are mentioned, but concerns about the accuracy of such approximations are raised.

Areas of Agreement / Disagreement

Participants express varying levels of confidence in their approaches to the problem, with some agreeing on the use of binomial coefficients while others question the feasibility of calculating probabilities directly due to the size of the numbers. The discussion remains unresolved regarding the best method to approach the calculations and the implications of the results.

Contextual Notes

Limitations include the challenges of calculating probabilities with large numbers, the dependence on definitions of terms like "favorable outcomes," and the unresolved nature of approximations versus exact calculations.

Matt2
Messages
2
Reaction score
0
Hi, I'm trying to figure out how to compute probability related to a problem I am tackling for work, and I think I have a handle on how to do it with smaller numbers, but no idea how to approach it for larger numbers. (And I need to explain the answers to a judge in simple terms). So here is the problem:

Imagine a company that maintains data about 200,000,000 Americans. Each month, this company takes a completely random sample of 5% of these reports to analyze. However, we are concerned only with 60,000 specific people out of this group of 200,000,000.

So we can visualize this as 60,000 "red marbles" and 199,940,000 "white marbles". Assuming that these combined 200,000,000 marbles are placed into a very large container and that 10,000,000 are selected randomly. I am trying to calculate:

1) The chance that none of the 10,000,000 marbles will be red;

2) The chance that 40 or more of the 10,000,000 marbles will be red;

3) The chance that all of the possible 60,000 red marbles will be included in the 10,000,000 selected.

Of these, the answer to question #2 is the most important, followed by #1, then #3.

Does anyone have an idea on where to start? I thought maybe it would make it easier to simply remove 4 zeroes from each number so that we are working with 20,000 / 6 / 1,000 but it seems that this skews the results. Would appreciate a pointer in the right direction.

Thanks!
Matt
 
Physics news on Phys.org
Matt said:
So we can visualize this as 60,000 "red marbles" and 199,940,000 "white marbles". Assuming that these combined 200,000,000 marbles are placed into a very large container and that 10,000,000 are selected randomly. I am trying to calculate:

1) The chance that none of the 10,000,000 marbles will be red;

Hi Matt! Welcome to MHB! :)

Suppose you pick 1 marble.
What is the chance that it won't be red?
What if you pick 2 marbles?
Or 10?
Or 10,000,000?
 
I like Serena said:
Hi Matt! Welcome to MHB! :)

Suppose you pick 1 marble.
What is the chance that it won't be red?
What if you pick 2 marbles?
Or 10?
Or 10,000,000?
the chance of 1 marble being white is 199,940,000 / 200,000,000
the chance of the 2nd marble being white would seem to be 199,939,999 / 199,999,999.

No idea after that. I became a lawyer because I wasn't good at math. :) And have tried a couple online math tutors who have told me that the answer to my questions cannot be calculated given the size of the numbers. Appreciate any help!
 
Matt said:
the chance of 1 marble being white is 199,940,000 / 200,000,000
the chance of the 2nd marble being white would seem to be 199,939,999 / 199,999,999.

No idea after that. I became a lawyer because I wasn't good at math. :) And have tried a couple online math tutors who have told me that the answer to my questions cannot be calculated given the size of the numbers. Appreciate any help!

Let's give those numbers a name.
Let's define $N=200,000,000$, $n=199,940,000$, and $k=10,000,000$.

So the chance of 1 marble being white is:
$$P(1\text{ white}) = \frac n N$$
For 2 marbles we get:
$$P(2\text{ white}) = \frac n N \frac{n-1}{N-1} = \frac {n(n-1)}{N(N-1)}$$
For $k$ marbles we get:
$$P(k\text{ white}) = \underbrace{\frac {n(n-1)...(n-k+1)}{N(N-1)...(N-k+1)}}_{k\text{ factors}}$$Alternatively, we can use the general formula:
$$P = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}$$

The number of favorable outcomes is the number of ways we can choose $k$ white marbles from the $n$ white marbles.
This is $n \choose k$.

The total number of outcomes is the number of ways we can choose $k$ marbles marbles from the total of $N$ marbles.
This is $N \choose k$.

So:
$$P(k\text{ white}) = \frac{n \choose k}{N \choose k}$$
 
Matt said:
Hi, I'm trying to figure out how to compute probability related to a problem I am tackling for work, and I think I have a handle on how to do it with smaller numbers, but no idea how to approach it for larger numbers. (And I need to explain the answers to a judge in simple terms). So here is the problem:

Imagine a company that maintains data about 200,000,000 Americans. Each month, this company takes a completely random sample of 5% of these reports to analyze. However, we are concerned only with 60,000 specific people out of this group of 200,000,000.

So we can visualize this as 60,000 "red marbles" and 199,940,000 "white marbles". Assuming that these combined 200,000,000 marbles are placed into a very large container and that 10,000,000 are selected randomly. I am trying to calculate:

1) The chance that none of the 10,000,000 marbles will be red;

2) The chance that 40 or more of the 10,000,000 marbles will be red;

3) The chance that all of the possible 60,000 red marbles will be included in the 10,000,000 selected.

[snip]
The answers to your questions are
1) Effectively zero (less than 10^-1336)
2) Effectively one (more than 1 - 40 * 10^-1246)
3) Effectively zero (less than 10^-78136)

Let's start by assigning names to some of your numbers:
N = 200,000,000
K = 60,000
n = 10,000,000
Then if p(x) is the probability the sample of n marbles will contain exactly x red marbles, then
$$p(x) = \frac{\binom{K}{x} \binom{N-K}{n-x}}{\binom{N}{n}}$$
where $\binom{n}{m} = \frac{n!}{m! (n-m)}$ is the number of ways to choose m items out of n, also known as a "binomial coefficient". See Hypergeometric distribution - Wikipedia, the free encyclopedia.

The trouble is, as you have already pointed out, it's not practical to calculate p(x) in this form with numbers as large as you have given; so we must resort to some tricks. We will calculate the logarithm of p(x) instead of calculating p(x) directly. This will enable us to deal with much smaller numbers in the intermediate calculations, so they can be done, for example, in double precision floating point, or in Excel, which uses double precision. So the first trick is to take the logarithm to the base e = 2.71828 of the equation for p(x):
$$\ln(p(x)) = \ln\binom{K}{x} + \ln \binom{N-K}{n-x} - \ln\binom{N}{n}$$
The second trick is to use a mathematical function available in Excel for the evaluation of the logarithms of the binomial coefficients. We have
$$\ln \binom{j}{k} = \ln \left( \frac{j!}{k! \; (j-k)!} \right) = \ln(j!) - \ln(k!) - \ln((j-k)!)$$
so we need a convenient way to evaluate $\ln(t!)$ for large values of $t$. Fortunately, Excel provides a function GAMMALN, defined by $GAMMALN(t) = \ln( \Gamma (t))$, where $\Gamma(t)$ is the Gamma function. (See Gamma function - Wikipedia, the free encyclopedia.) Since $t! = \Gamma(t+1)$ for a positive integer $t$, we have $GAMMALN(t+1) = \ln(t!)$.

If we put this all together and evaluate ln(p(0)) in an Excel spreadsheet, we find $\ln(p(0)) = -3078.07$, so $$\log_{10}(p(0)) = \frac{\ln(p(0))}{\ln(10)} = -1336.79$$ This shows that $p(0) < 10^{-1336}$, the answer to your first question.

For question 2), note that the probability that the sample of n marbles will not contain at least 40 red marbles is
$$\sum_{x=0}^{39} p(x)$$
If we go through the same steps as above to evaluate $\log_{10}(p(x))$ for $x = 0, 1, 2, \dots , 39$, we find the largest number in the sequence is $\log_{10}(p(39)) = -1246.62$. This shows $p(x) < 10^{-1246}$ for $x = 0, 1, 2, \dots , 39$, so
$$\sum_{x=0}^{39} p(x) < 40 \cdot 10^{-1246}$$ Since this is the probability that the sample does not contain at least 40 red marbles, the probability that the sample does contain at least 40 red marbles is greater than $1 - 40 \cdot 10^{-1246}$.

For question 3), we use the same method to evaluate the logarithm of p(60,000), and we find $\log_{10}(p(60,000)) = -78136.22$, so $p(60,000) < 10 ^{-78136}$, which is a small number indeed.

There are other ways to approach the problem. For example, we could approximate the Hypergeometric distribution with a Normal distribution. But I don't know how to bound the error of the approximation, so that method might be less convincing in court.

[edit] I changed some of the variable names, because in some cases I had used the same name for two different purposes in the original post. I hope this version is less confusing.[/edit]
 
Last edited:

Similar threads

Replies
2
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 6 ·
Replies
6
Views
9K
  • · Replies 11 ·
Replies
11
Views
4K
  • · Replies 9 ·
Replies
9
Views
3K