Sample size required in hypergeometric test

In summary: It is not clear what is meant by "a certain value, called p_max." In the statement of the problem, "p_max" is a fixed number. But, later on in the conversation, p_max is being treated as a random variable. It seems to me that there is a need to make up your mind about which one it is.In summary, the conversation discusses a hypergeometric distribution with variables N, K, n, and k. The problem is to determine the size of n and the maximum value of k in order to have a certain level of certainty, c, that the fraction of red balls in the total population, K/N, is lower than a given value, p_max. The
  • #1
M_1
31
1
I have a hypergeometric distribution with:

N=total population of red and green balls, I now this
K=total number of red balls, I don't know this
n=sample size (number of investigated balls), I can choose this
k=number of investigated balls that are red, I don't know this

Red balls are a problem and I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

How big must n be? And for a given n; what is the maximum value of k in order to approve the total population N. (I mean if I cannot guarantee with certainty c that K/N is below p_max I have to scrap the entire population N.)

For example, N=1000, c=0.95, and p=0.1.

I hope this is graduate level, at least it beats me :-)
 
Physics news on Phys.org
  • #2
M_1 said:
How big must n be?
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
 
  • #3
mfb said:
Depends on K.

As an example, in a box with 28 red balls and 72 green balls, it is hard to figure out (with some confidence) if the fraction of red balls is larger than 30%: you'll need to investigate most balls. If the box has 0 red balls and 100 green balls, you can stop after investigating just about 10 balls.

You can re-evaluate probabilities (for p_max and for the observed fraction) after each drawn ball, but then requiring some given confidence level does not work easily as you do p-hacking.
Many thanks!
1) What does "p-hacking" mean?
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar. Or a reference to such an expression.
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95. Almost like my original question but a double-sided interval. Maybe this can be solved without re-evaluation after each drawn ball.
 
  • #4
M_1 said:
1) What does "p-hacking" mean?
About 50 million google hits...
It means looking for something falling below/above some arbitrary threshold until you find something.
M_1 said:
2) Assume I know K. Could your example with 28/72 and 0/100 balls be expressed so n is given in a mathematical expression in terms of N, c, p, and maximum k, or similar.
I'm not sure if there is a closed form, but you can certainly analyze every case separately.
M_1 said:
3) If N=1000 balls in total, I draw 100 balls of which 8 are read. What can I say about K/N? I mean in terms of K/N=0.08 +/- x, with some certainty c, for example 0.95.
Construct confidence intervals via the hypergeometric distribution.
 
  • #5
M_1 said:
I want to make sure, with a certainty c, that in the total population the fraction K/N of red balls is lower that a certain value, called p_max.

"Certainty" and "probability" are two diametrically opposed concepts.

To have a mathematical problem, you must define what you mean by "with certainty c".

If "c" is supposed to represent a probability, then what event is it that must have probability "c"? What is the sample space in which such an event is defined ?

For example if "c" is "the probability that K/N < p_max" then what is the sample space for the event "K/N < p_max"? Presumably it has to be a space where there are various possible values of K/N. How do you assign a probability distribution to the events in this space? (This leads to Bayesian statistical methods.)

Does "c" represent statistical "confidence"? In the scenario for "confidence" there is some population parameter P and sample statistic used to estimate it. You are not asking about how to estimate the population parameter K/N, you are asking about estimating (in the sense of "yes" or "no") whether it satisfies a certain inequality. So we'll have to think carefully about how "confidence" relates to your question.
I hope this is graduate level, at least it beats me :-)

As far a computations, the problem might not be on the graduate level. But, as far as conceptual understanding, it is exceeds the level that typical undergraduates attain. To get a mathematical answer, you have to struggle with the question of "What am I really asking?".
 
  • Like
Likes mfb

1. What is a hypergeometric test?

A hypergeometric test is a statistical test used to determine whether a sample of data represents a specific distribution or not. It is commonly used in genetics and epidemiology to analyze the relationship between two categorical variables.

2. How is sample size determined in a hypergeometric test?

The sample size required for a hypergeometric test depends on the number of individuals in the population, the number of individuals with the specific characteristic or trait being studied, and the desired level of statistical significance. Generally, a larger sample size is needed for a more accurate and reliable result.

3. What factors affect the required sample size in a hypergeometric test?

The required sample size in a hypergeometric test is influenced by the desired level of statistical power, the desired level of statistical significance, and the expected effect size. Additionally, the variability of the data and the population size can also impact the required sample size.

4. Is there a formula for calculating the required sample size in a hypergeometric test?

Yes, there is a formula for calculating the required sample size in a hypergeometric test. It is based on the hypergeometric distribution and takes into account the factors mentioned in the previous questions. However, in practice, statistical software or online calculators are often used to determine the required sample size.

5. Why is it important to determine the appropriate sample size in a hypergeometric test?

Determining the appropriate sample size in a hypergeometric test is crucial because it directly affects the power and accuracy of the test. An insufficient sample size may lead to a lack of statistical power, making it difficult to detect a true relationship between the variables being studied. On the other hand, an overly large sample size may be wasteful and unnecessary. Therefore, it is important to carefully consider and determine the appropriate sample size for a hypergeometric test to obtain reliable and meaningful results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
946
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
898
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
280
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
780
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
859
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
765
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
Back
Top