How Many DNA Molecules to Sample for Sufficient Unique Sequences?

Click For Summary
SUMMARY

The discussion centers on calculating the number of DNA molecules needed to confidently obtain a specified number of unique sequences from a total of 2x1012 unique DNA sequences, with an average of 47 copies each. The calculation involves understanding the coupon collector's problem and applying statistical techniques such as Chebyshev's inequality to estimate the minimum sample size required for achieving at least 1010, 1011, or 1012 unique sequences. The variability in the number of copies per sequence significantly affects the outcome, necessitating a careful approach to modeling the sampling process.

PREREQUISITES
  • Understanding of the coupon collector's problem
  • Familiarity with Chebyshev's inequality
  • Basic knowledge of probability distributions
  • Experience with statistical modeling techniques
NEXT STEPS
  • Study the coupon collector's problem in depth
  • Learn about Chebyshev's inequality and its applications
  • Explore probability distributions relevant to sampling
  • Investigate statistical modeling techniques for estimating sample sizes
USEFUL FOR

This discussion is beneficial for geneticists, statisticians, and researchers involved in DNA sequencing and sampling methodologies, particularly those aiming to optimize the collection of unique genetic sequences.

1eray
Messages
1
Reaction score
0
I have 2x10^12 unique sequences of DNA, and I have an average of 47 copies of each sequence (so 94x10^12 DNA molecules total).
How many molecules do I need to choose at random to be "confident" (defined as you please) that I have at least 10^10 unique molecules? 10^11? 10^12?
I would really like to know how to do this calculation.
Any help would be very appreciated.
Thanks,
Ed
 
Physics news on Phys.org
If the number of copies of each is not exactly 47 then the answer could vary wildly (consider the case with 1 copy of all but one, and copies of a single one making up the rest).

First step would be to show that the number required is somewhere between the number required when there is 1 copy of all (trivial) and the number required when there are infinitely many copies (which is a version of the coupon collector's problem).

To solve the latter problem you'd use the same techniques as for the CCP but truncate the sums appropriately. In effect you're modelling how the number of distinct copies found increases (randomly) as you add one more to the sample. Then, for example, apply Chebyshev's inequality to the 1st and 2nd moments of the distribution as a function of the sample size which would give you a very conservative estimate of the minimum number required.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
918
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
3
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K