Sample size required in hypergeometric test

Click For Summary
SUMMARY

The forum discussion centers on determining the sample size required for a hypergeometric test to ensure that the fraction of red balls (K/N) in a population is below a specified threshold (p_max) with a given confidence level (c). The participants explore the implications of varying the total population (N), the number of red balls (K), and the sample size (n). They emphasize the complexity of establishing confidence intervals and the challenges of p-hacking when re-evaluating probabilities after each draw. Key examples illustrate the need for larger sample sizes when the proportion of red balls is uncertain.

PREREQUISITES
  • Understanding of hypergeometric distribution
  • Familiarity with confidence intervals and statistical significance
  • Knowledge of p-hacking and its implications in statistical analysis
  • Basic proficiency in Bayesian statistical methods
NEXT STEPS
  • Study the mathematical formulation of the hypergeometric distribution
  • Learn how to construct confidence intervals using hypergeometric tests
  • Research the concept of p-hacking and its impact on statistical validity
  • Explore Bayesian methods for estimating population parameters
USEFUL FOR

Statisticians, data scientists, graduate students in statistics, and anyone involved in hypothesis testing and sample size determination in research.

M_1
Messages
30
Reaction score
1
I have a hypergeometric distribution with:

N=total population of red and green balls, I now this
K=total number of red balls, I don't know this
n=sample size (number of investigated balls), I can choose this
k=number of investigated balls that are red, I don't know this

Red balls are a problem and I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

How big must n be? And for a given n; what is the maximum value of k in order to approve the total population N. (I mean if I cannot guarantee with certainty c that K/N is below p_max I have to scrap the entire population N.)

For example, N=1000, c=0.95, and p=0.1.

I hope this is graduate level, at least it beats me :-)
 
Physics news on Phys.org
M_1 said:
How big must n be?
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
 
mfb said:
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
Many thanks!
1) What does "p-hacking" mean?
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar. Or a reference to such an expression.
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95. Almost like my original question but a double-sided interval. Maybe this can be solved without re-evaluation after each drawn ball.
 
M_1 said:
1) What does "p-hacking" mean?
About 50 million google hits...
It means looking for something falling below/above some arbitrary threshold until you find something.
M_1 said:
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar.
I'm not sure if there is a closed form, but you can certainly analyze every case separately.
M_1 said:
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95.
Construct confidence intervals via the hypergeometric distribution.
 
M_1 said:
I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

"Certainty" and "probability" are two diametrically opposed concepts.

To have a mathematical problem, you must define what you mean by "with certainty c".

If "c" is supposed to represent a probability, then what event is it that must have probability "c"? What is the sample space in which such an event is defined ?

For example if "c" is "the probability that K/N < p_max" then what is the sample space for the event "K/N < p_max"? Presumably it has to be a space where there are various possible values of K/N. How do you assign a probability distribution to the events in this space? (This leads to Bayesian statistical methods.)

Does "c" represent statistical "confidence"? In the scenario for "confidence" there is some population parameter P and sample statistic used to estimate it. You are not asking about how to estimate the population parameter K/N, you are asking about estimating (in the sense of "yes" or "no") whether it satisfies a certain inequality. So we'll have to think carefully about how "confidence" relates to your question.
I hope this is graduate level, at least it beats me :-)

As far a computations, the problem might not be on the graduate level. But, as far as conceptual understanding, it is exceeds the level that typical undergraduates attain. To get a mathematical answer, you have to struggle with the question of "What am I really asking?".
 
  • Like
Likes   Reactions: mfb

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
1
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K