# A Sample size required in hypergeometric test

1. Oct 25, 2016

### M_1

I have a hypergeometric distribution with:

N=total population of red and green balls, I now this
K=total number of red balls, I don't know this
n=sample size (number of investigated balls), I can choose this
k=number of investigated balls that are red, I don't know this

Red balls are a problem and I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

How big must n be? And for a given n; what is the maximum value of k in order to approve the total population N. (I mean if I cannot guarantee with certainty c that K/N is below p_max I have to scrap the entire population N.)

For example, N=1000, c=0.95, and p=0.1.

I hope this is graduate level, at least it beats me :-)

2. Oct 25, 2016

### Staff: Mentor

Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.

3. Oct 25, 2016

### M_1

Many thanks!
1) What does "p-hacking" mean?
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar. Or a reference to such an expression.
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95. Almost like my original question but a double-sided interval. Maybe this can be solved without re-evaluation after each drawn ball.

4. Oct 26, 2016

### Staff: Mentor

It means looking for something falling below/above some arbitrary threshold until you find something.
I'm not sure if there is a closed form, but you can certainly analyze every case separately.
Construct confidence intervals via the hypergeometric distribution.

5. Oct 26, 2016

### Stephen Tashi

"Certainty" and "probability" are two diametrically opposed concepts.

To have a mathematical problem, you must define what you mean by "with certainty c".

If "c" is supposed to represent a probability, then what event is it that must have probability "c"? What is the sample space in which such an event is defined ?

For example if "c" is "the probability that K/N < p_max" then what is the sample space for the event "K/N < p_max"? Presumably it has to be a space where there are various possible values of K/N. How do you assign a probability distribution to the events in this space? (This leads to Bayesian statistical methods.)

Does "c" represent statistical "confidence"? In the scenario for "confidence" there is some population parameter P and sample statistic used to estimate it. You are not asking about how to estimate the population parameter K/N, you are asking about estimating (in the sense of "yes" or "no") whether it satisfies a certain inequality. So we'll have to think carefully about how "confidence" relates to your question.

As far a computations, the problem might not be on the graduate level. But, as far as conceptual understanding, it is exceeds the level that typical undergraduates attain. To get a mathematical answer, you have to struggle with the question of "What am I really asking?".