Sample size required in hypergeometric test

Click For Summary

Discussion Overview

The discussion revolves around determining the sample size required in a hypergeometric test to ensure that the fraction of red balls in a population is below a specified threshold with a certain level of confidence. Participants explore the implications of different parameters and the relationship between sample size, population characteristics, and statistical confidence.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant notes that the required sample size depends on the unknown total number of red balls (K) in the population.
  • Examples are provided to illustrate how the number of red balls affects the confidence in estimating their proportion, suggesting that more samples are needed when the proportion is closer to the threshold.
  • Another participant raises questions about the meaning of "p-hacking" and its implications for statistical confidence when evaluating probabilities after each draw.
  • There is a request for a mathematical expression that relates sample size (n) to population size (N), confidence level (c), maximum k, and other parameters, but uncertainty remains about the existence of a closed form.
  • Concerns are expressed about the distinction between "certainty" and "probability," prompting a discussion on how to define the event for which the probability is calculated and the implications for Bayesian methods.
  • Participants discuss constructing confidence intervals using the hypergeometric distribution based on sample observations.

Areas of Agreement / Disagreement

Participants express differing views on the interpretation of "certainty" versus "probability" and how these concepts relate to the problem at hand. There is no consensus on the mathematical formulation of the problem or the implications of the parameters involved.

Contextual Notes

Limitations include the need for clear definitions of terms such as "certainty" and "confidence," as well as the unresolved nature of the mathematical relationships between the parameters discussed.

M_1
Messages
30
Reaction score
1
I have a hypergeometric distribution with:

N=total population of red and green balls, I now this
K=total number of red balls, I don't know this
n=sample size (number of investigated balls), I can choose this
k=number of investigated balls that are red, I don't know this

Red balls are a problem and I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

How big must n be? And for a given n; what is the maximum value of k in order to approve the total population N. (I mean if I cannot guarantee with certainty c that K/N is below p_max I have to scrap the entire population N.)

For example, N=1000, c=0.95, and p=0.1.

I hope this is graduate level, at least it beats me :-)
 
Physics news on Phys.org
M_1 said:
How big must n be?
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
 
mfb said:
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
Many thanks!
1) What does "p-hacking" mean?
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar. Or a reference to such an expression.
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95. Almost like my original question but a double-sided interval. Maybe this can be solved without re-evaluation after each drawn ball.
 
M_1 said:
1) What does "p-hacking" mean?
About 50 million google hits...
It means looking for something falling below/above some arbitrary threshold until you find something.
M_1 said:
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar.
I'm not sure if there is a closed form, but you can certainly analyze every case separately.
M_1 said:
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95.
Construct confidence intervals via the hypergeometric distribution.
 
M_1 said:
I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

"Certainty" and "probability" are two diametrically opposed concepts.

To have a mathematical problem, you must define what you mean by "with certainty c".

If "c" is supposed to represent a probability, then what event is it that must have probability "c"? What is the sample space in which such an event is defined ?

For example if "c" is "the probability that K/N < p_max" then what is the sample space for the event "K/N < p_max"? Presumably it has to be a space where there are various possible values of K/N. How do you assign a probability distribution to the events in this space? (This leads to Bayesian statistical methods.)

Does "c" represent statistical "confidence"? In the scenario for "confidence" there is some population parameter P and sample statistic used to estimate it. You are not asking about how to estimate the population parameter K/N, you are asking about estimating (in the sense of "yes" or "no") whether it satisfies a certain inequality. So we'll have to think carefully about how "confidence" relates to your question.
I hope this is graduate level, at least it beats me :-)

As far a computations, the problem might not be on the graduate level. But, as far as conceptual understanding, it is exceeds the level that typical undergraduates attain. To get a mathematical answer, you have to struggle with the question of "What am I really asking?".
 
  • Like
Likes   Reactions: mfb

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
1
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 3 ·
Replies
3
Views
1K