Graduate Sample size required in hypergeometric test

Click For Summary
The discussion focuses on determining the sample size required in a hypergeometric test to ensure that the fraction of red balls in a total population is below a specified threshold, p_max, with a given certainty level, c. It highlights the complexity of calculating sample size based on the unknown total number of red balls, K, and emphasizes that confidence intervals can be constructed using the hypergeometric distribution. The conversation also touches on the concept of "p-hacking" and the need for clarity in defining certainty and probability in statistical contexts. Additionally, it suggests that understanding the underlying statistical principles is crucial for addressing the problem effectively. The importance of careful formulation of the question is stressed for achieving a mathematical solution.
M_1
Messages
30
Reaction score
1
I have a hypergeometric distribution with:

N=total population of red and green balls, I now this
K=total number of red balls, I don't know this
n=sample size (number of investigated balls), I can choose this
k=number of investigated balls that are red, I don't know this

Red balls are a problem and I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

How big must n be? And for a given n; what is the maximum value of k in order to approve the total population N. (I mean if I cannot guarantee with certainty c that K/N is below p_max I have to scrap the entire population N.)

For example, N=1000, c=0.95, and p=0.1.

I hope this is graduate level, at least it beats me :-)
 
Physics news on Phys.org
M_1 said:
How big must n be?
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
 
mfb said:
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
Many thanks!
1) What does "p-hacking" mean?
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar. Or a reference to such an expression.
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95. Almost like my original question but a double-sided interval. Maybe this can be solved without re-evaluation after each drawn ball.
 
M_1 said:
1) What does "p-hacking" mean?
About 50 million google hits...
It means looking for something falling below/above some arbitrary threshold until you find something.
M_1 said:
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar.
I'm not sure if there is a closed form, but you can certainly analyze every case separately.
M_1 said:
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95.
Construct confidence intervals via the hypergeometric distribution.
 
M_1 said:
I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

"Certainty" and "probability" are two diametrically opposed concepts.

To have a mathematical problem, you must define what you mean by "with certainty c".

If "c" is supposed to represent a probability, then what event is it that must have probability "c"? What is the sample space in which such an event is defined ?

For example if "c" is "the probability that K/N < p_max" then what is the sample space for the event "K/N < p_max"? Presumably it has to be a space where there are various possible values of K/N. How do you assign a probability distribution to the events in this space? (This leads to Bayesian statistical methods.)

Does "c" represent statistical "confidence"? In the scenario for "confidence" there is some population parameter P and sample statistic used to estimate it. You are not asking about how to estimate the population parameter K/N, you are asking about estimating (in the sense of "yes" or "no") whether it satisfies a certain inequality. So we'll have to think carefully about how "confidence" relates to your question.
I hope this is graduate level, at least it beats me :-)

As far a computations, the problem might not be on the graduate level. But, as far as conceptual understanding, it is exceeds the level that typical undergraduates attain. To get a mathematical answer, you have to struggle with the question of "What am I really asking?".
 
  • Like
Likes mfb
If there are an infinite number of natural numbers, and an infinite number of fractions in between any two natural numbers, and an infinite number of fractions in between any two of those fractions, and an infinite number of fractions in between any two of those fractions, and an infinite number of fractions in between any two of those fractions, and... then that must mean that there are not only infinite infinities, but an infinite number of those infinities. and an infinite number of those...

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
1
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K