Calculating probablity that random subset of population contains duplicates

  • Context: MHB 
  • Thread starter Thread starter mads1
  • Start date Start date
  • Tags Tags
    population Random
Click For Summary
SUMMARY

The discussion revolves around calculating the expected number of duplicates in a sample drawn from a population, specifically using biological data. The user is interested in understanding how to compute this expectation when sampling from a population size of 3 million, with varying sample sizes. Key to this calculation is the hypergeometric distribution, which is essential for determining the probability of duplicates based on the sampling method, whether with or without replacement. The user has access to R and is encouraged to explore relevant libraries for implementation.

PREREQUISITES
  • Understanding of hypergeometric distribution
  • Familiarity with R programming language
  • Knowledge of statistical sampling methods
  • Basic concepts of probability theory
NEXT STEPS
  • Research the hypergeometric distribution and its applications in statistics
  • Learn how to implement statistical calculations in R using packages like 'dplyr' and 'ggplot2'
  • Explore sampling methods, specifically the differences between sampling with and without replacement
  • Investigate how to visualize sampling distributions in R
USEFUL FOR

Statisticians, data scientists, biologists, and anyone involved in sampling and analyzing biological data who needs to understand the implications of duplicates in their samples.

mads1
Messages
1
Reaction score
0
Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m
 
Physics news on Phys.org
mads said:
Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m
If I understand the problem correctly, then I think you should take a look at the hypergeometric distribution (use your preferred search engine).
 
Hi Mads,

What do you mean by a "duplicate"? Do you mean its like you caught a fish, threw if back into the lake, and then caught the same fish again? Or is it like catching another fish of the same species? And to pursue the fishing analogy further, do you return the fish to the lake ("sampling with replacement"), or do you keep it ("sampling without replacement")?
 

Similar threads

  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K