How Many DNA Molecules to Sample for Sufficient Unique Sequences?

AI Thread Summary
To determine the number of DNA molecules needed to confidently sample at least 10^10, 10^11, or 10^12 unique sequences from a pool of 2x10^12 unique sequences with an average of 47 copies each, one approach is to reference the coupon collector's problem. The required number of samples lies between the scenarios of having one copy of each sequence and infinitely many copies. By modeling the increase in distinct sequences found as more samples are added, one can apply statistical techniques, such as Chebyshev's inequality, to estimate the minimum sample size needed. Variability in the number of copies per sequence can significantly impact the results, making precise calculations essential. Understanding these dynamics is crucial for effective sampling strategies in genetic research.
1eray
Messages
1
Reaction score
0
I have 2x10^12 unique sequences of DNA, and I have an average of 47 copies of each sequence (so 94x10^12 DNA molecules total).
How many molecules do I need to choose at random to be "confident" (defined as you please) that I have at least 10^10 unique molecules? 10^11? 10^12?
I would really like to know how to do this calculation.
Any help would be very appreciated.
Thanks,
Ed
 
Physics news on Phys.org
If the number of copies of each is not exactly 47 then the answer could vary wildly (consider the case with 1 copy of all but one, and copies of a single one making up the rest).

First step would be to show that the number required is somewhere between the number required when there is 1 copy of all (trivial) and the number required when there are infinitely many copies (which is a version of the coupon collector's problem).

To solve the latter problem you'd use the same techniques as for the CCP but truncate the sums appropriately. In effect you're modelling how the number of distinct copies found increases (randomly) as you add one more to the sample. Then, for example, apply Chebyshev's inequality to the 1st and 2nd moments of the distribution as a function of the sample size which would give you a very conservative estimate of the minimum number required.
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top