# Probability of drawing duplicates (warning long post)

0. Introduction
This is not actually homework. This is from real-life work in life science. I have tried to break it down into basic maths, though it will require an extension once I figure out the basics.

The following part here will serve as an introduction to the subject. Feel free to skip over it if you don't feel like reading a wall of text.

As said, I work in the Life Sciences. Next Generation Sequencing (NGS), to be more precise. We start with genomic DNA (~3.3 billion basepairs (bp)) which we randomly shear into fragments of ~150-200 bp. The idea here is that each of the resulting fragments is unique.
After some biochemical steps (which I won't tire you with) we perform some cycles of polymerase chain reaction (PCR). You don't have to grasp the process underlying PCR, suffice to say it makes duplicates of the original fragments. All samples undergo the same number of PCR cycles, resulting (presumably) in the same ratio of duplication.

All fragments are in the same pool. We then perform NGS on that pool. NGS is very sensitive to overloading the DNA, so we strive to always put in the same total number of fragments (by varying the fraction of the total pool that we put in. E.g. if it has high concentration DNA we put in a small fraction of the pool and vice versa).

Again, NGS does not need to be understood. All you need to know is that we can derive the sequence of the bases (G, A, T and C) from all the fragments that are in the pool. These fragment sequences are then mapped to the human genome. The ones that map to a unique position of the genome (virtually all of them) are called 'reads'.

Now, some fraction of the reads that we obtain will be unique, and the remaining fraction will be a duplicate of those unique reads. Obviously, the maximum level of duplication is obtained when you put in the whole pool, at which point you get the duplication rate as it is present in the original pool. The smaller the fraction of the pool that you put into the machine, the lower your duplication rate will be. Plotting our data, I have found that the relation between duplication rate and pool fraction is eerily close to logarithmic (##R^2>0.98##). I strive to understand why this is.

I have tried to break it down to simpler examples first. This is what I will first be asking help on, before expanding it to the real life situation.

1. Homework Statement

Ultimate problem:
From a pool of N fragments (or marbles, if you will), with U unique identities (the rest N-U are duplicates of these) draw a number K.
What is the relation between:
- the duplication rate in the pool (##\frac{N-U}{N}##) and the fraction of the pool drawn (##\frac{K}{N}##)
versus
- the duplication rate found in K (described by ##\frac{K-U_K}{K}##)?

Simplification #1:
Let's assume a pool with N=10 and U=5. Furthermore, each unique has exactly one duplicate. E.g. 10 marbles with 5 different colours, each of which is present twice (two yellow, two blue, two green, two red, two purple).

What's the relation between the real duplication rate, % of pool drawn and the found duplication rate?

Simplification #2:
Assume a pool with N=15 and U=5. This time each unique has exactly two duplicates (three yellow, three blue, etc).

What's the relation between the real duplication rate, % of pool drawn and the found duplication rate?

## Homework Equations

Basic probability statistics.

## The Attempt at a Solution

Simplification #1:
The real duplication rate (in the pool) ##\frac{N-U}{N}=\frac{10-5}{10}=\frac{1}{2}##.

Number of duplicates
1) If I draw one marble, I have no duplicates.
2) If I draw a second marble, there's probability 1/9 of drawing the same colour as the first, and probability 8/9 of drawing a second unique colour.
3a) If the second marble was a duplicate, there's probability 8/8 of drawing a second unique colour.
3b) If the second marble was unique, there's probability 2/8 of drawing a duplicate to either colour, and probability 6/8 of drawing a third unique colour.
.. etc.

Duplication rate
For the duplication rate, I simply multiply the probability of finding D duplicates (##D_K=K-U_K##) with the number D, sum over all D and divide by K.

I found that the duplication rate in this case can be described by ##\frac{K-1}{2*(N-1)}##. I.e. a linear equation. But why?

Simplification #2:
The real duplication rate (in the pool) ##\frac{N-U}{N}=\frac{15-5}{15}=\frac{2}{3}##.

Number of duplicates
1) One marble, no duplicates
2) Second marble, probability 2/14 of drawing first duplicate, 12/14 for second unique colour.
3a) If second was a duplicate, probability 1/13 of drawing another duplicate, 12/13 for second unique colour.
3b) If second was unique, probability 4/13 of drawing first duplicate, 9/13 for third unique.
... etc.

Duplication rate
Calculation same as before.
I found that the duplication rate in this case fits with a second order polynomial (##R^2=1##).

Similarly if I enter 4 marbles of the same colour, it fits (##R^2=1##) to a third order polynomial. My problem is that I don't see why this must be so.

4. What I want
I'd first like to understand the found relationships for the first and second simplifications. Why are they 1st, 2nd and 3rd order polynomial? Does this relation expand (e.g. if I have 1000 marbles of each colour, will it resemble a 999th order polynomial)?
From there, I would like to know why the data I found seems to fit to a logarithmic trendline.

Thanks in advance for any help provided!
Feel free to ask additional questions if any of the above is unclear.

mfb
Mentor
With K elements drawn out of N/2 pairs, you have K(K-1)/2 possible combinations that can be the same color with a probability of 1/(N-1) each. For the expectation value, correlations between those probabilities do not matter, so you expect K(K-1)/(2(N-1)) pairs. To get the rate of duplicates, you divide that by K and get the formula you posted.

With triplets, you also have to consider sets of 3 that can have the same color, so you get terms like K(K-1)(K-2), divided by K you get a parabola.

What is a typical number of fragments in your sample? If it is large, it is possible to simplify the problem a lot. Also, do you have some rough estimate about the variation of that number? That could influence the distribution.

haruspex
Homework Helper
Gold Member
I find it a bit easier to think in terms of the number of unique fragments found in K selections.
Assume N is sufficiently large that, in effect, this is selection with replacement. I.e., even after picking several fragments of one particular species, you still have a 1/U probability of picking it again in the next selection.
Let P(K, S) be the probability of picking exactly S distinct species in K selections. For K > 0, we can write down a recurrence relation. This expresses P(K, S) in terms of P(K-1, S) and P(K-1, S-1) when K >= S > 0. (Can you do that?)
Fortunately, you don't need to solve this. You only care about expected value, ##E_K = \Sigma _{S=0}^K S P(K, S)##. Applying that summation to the recurrence relation, the right hand side simplifies to produce a very easy recurrence relation in EK. Your 'duplication rate' is 1-EK/U. To a first approximation this is a negative exponential function of K/U.

Thanks a lot for your replies mfb and haruspex!

With K elements drawn out of N/2 pairs, you have K(K-1)/2 possible combinations that can be the same color with a probability of 1/(N-1) each. For the expectation value, correlations between those probabilities do not matter, so you expect K(K-1)/(2(N-1)) pairs. To get the rate of duplicates, you divide that by K and get the formula you posted.

With triplets, you also have to consider sets of 3 that can have the same color, so you get terms like K(K-1)(K-2), divided by K you get a parabola.
Ah, this makes a lot of sense. Thanks a bunch!
This does imply that for 1000-lets I'd get a 999th order polynomial. Would this be anything close to logarithmic at some point?

What is a typical number of fragments in your sample? If it is large, it is possible to simplify the problem a lot. Also, do you have some rough estimate about the variation of that number? That could influence the distribution.
Usual in our workflow is about 1-2M fragments, of whom roughly 10% are typically duplicates.
The total number of fragments varies relatively little (2- or 3-fold between the smallest and the largest), but the fraction of duplicates varies wildly: anywhere between 2 and 40% of the total number of fragments will be duplicates. As noted, this relates to the fraction of the pool that we're putting in. Good experiments (with a high molar output) will be closer to 2%, while poor experiments (with the bare minimum output) will have more duplicates.

Also keep in mind that (in contrast to the simplifications) there need not be equal numbers of duplicates for each species. It's entirely possible that one fragment has 0 duplicates while another has 1000.

I find it a bit easier to think in terms of the number of unique fragments found in K selections.
Assume N is sufficiently large that, in effect, this is selection with replacement. I.e., even after picking several fragments of one particular species, you still have a 1/U probability of picking it again in the next selection.
Let P(K, S) be the probability of picking exactly S distinct species in K selections. For K > 0, we can write down a recurrence relation. This expresses P(K, S) in terms of P(K-1, S) and P(K-1, S-1) when K >= S > 0. (Can you do that?)
Let's see if I'm catching your drift till here:
##P(K,S)=\frac{S}{U}*P(K-1,S)+\frac{U-(S-1)}{U}*P(K-1,S-1)##

My brain is a bit slow today. Is this anywhere near correct or what you meant?

Fortunately, you don't need to solve this. You only care about expected value, ##E_K = \Sigma _{S=0}^K S P(K, S)##. Applying that summation to the recurrence relation, the right hand side simplifies to produce a very easy recurrence relation in EK. Your 'duplication rate' is 1-EK/U. To a first approximation this is a negative exponential function of K/U.
I found that ##E_K = \Sigma _{S=0}^K S P(K, S) = \Sigma _{S=0}^{K-1} S (1+\frac{1}{S}-\frac{1}{U}) P(K-1, S)##
Again I'm not at all sure if this is correct. Probably not, seeing how you mentioned it being 'very easy'. Could you give me a pointer where I went wrong?

Thanks again in advance for any help provided!

mfb
Mentor
This does imply that for 1000-lets I'd get a 999th order polynomial. Would this be anything close to logarithmic at some point?
This polynomial has so many degrees of freedom that it is hard to tell without further analysis. For those large numbers there are better analysis methods, however.
Usual in our workflow is about 1-2M fragments, of whom roughly 10% are typically duplicates.
Those numbers are for the small part of the sample you perform NGS on?

As far as I understand it:
After PCR we have something like 10 million different DNA segments, each one appears 0 to [whatever] times in the sample and we do not know the distribution. We have some rough estimate on the total number of DNA segments.
We take a small part Q of the total pool and perform NGS on it. The total number of DNA segments in this small part should be somewhere around 1-2 million. We are interested in the total number of different DNA segments in this number.

Let's assume that all R different DNA segments occur exactly L times in he original sample. N=R*L, K=Q*N=Q*R*L.
You miss a DNA segment if you miss all its replications, this happens with a probability of
$$P_{miss}=(1-Q)^L$$. Therefore, we catch ##R \cdot (1-P_{miss})## different segments. The duplication rate is then given by $$D = 1- \frac{1-(1-Q)^L}{QL}$$
This should be exact apart from the very good approximation that your sample volume is huge compared to the volume a single DNA segment occupies. For some choices of L and ranges of Q it looks not completely different from a logarithm, but I think the following approximations are better.

Your small duplication rate and also the comparison of 1-2 millions to (L*10 millions) suggests QL << 1. That suggests the definition Q=c/L where c is small.
Plugging it in gives
$$D = 1- \frac{1-(1-\frac{c}{L})^L}{c}$$
In the limit of large L (!), ##(1-\frac{c}{L})^L \approx e^{-c}##
Therefore,
$$D = 1- \frac{1-(1-\frac{c}{L})^L}{c} \approx 1- \frac{1-e^{-c}}{c}$$
Now using that c is small, ##e^{-c} \approx 1-c+\frac{c^2}{2}-\frac{c^3}{6}##
$$D \approx \frac{c}{2} - \frac{c^2}{6}$$

Careful: The approximations require both QL << 1 and L>>1.
The assumption that L>>1 gives about 5% error at L=20 and small QL.

If L is not so large, the approximation ##1-(1-\frac{c}{L})^L \approx (1-e^{-c})\cdot (1-\frac{1}{L})## is better. On the other hand, the exact formula is not so complicated...

If L is different for different strands, it is possible to perform that analysis for each value of L separately and make a weighted average afterwards. In general, a broader distribution with the same mean L will increase the rate of duplicates as more frequent strands generate duplicates faster than linear in their multiplicity.

Last edited:
haruspex
Homework Helper
Gold Member
Let's assume that all R different DNA segments occur exactly L times in he original sample. N=R*L, K=Q*N=Q*R*L.
You miss a DNA segment if you miss all its replications, this happens with a probability of
$$P_{miss}=(1-Q)^L$$. Therefore, we catch ##R \cdot (1-P_{miss})## different segments.
.
But the probabilities of missing different segments are not independent. If we miss one there is a reduced chance of missing others.

haruspex
Homework Helper
Gold Member
Let's see if I'm catching your drift till here:
##P(K,S)=\frac{S}{U}*P(K-1,S)+\frac{U-(S-1)}{U}*P(K-1,S-1)##

My brain is a bit slow today. Is this anywhere near correct or what you meant?
Yes, that's it.
I found that ##E_K = \Sigma _{S=0}^K S P(K, S) = \Sigma _{S=0}^{K-1} S (1+\frac{1}{S}-\frac{1}{U}) P(K-1, S)##
Again I'm not at all sure if this is correct. Probably not, seeing how you mentioned it being 'very easy'. Could you give me a pointer where I went wrong?
I said it's easy after simplification. But what you have on the right there looks very wrong to me. Please post your working.

mfb
Mentor
But the probabilities of missing different segments are not independent. If we miss one there is a reduced chance of missing others.
This is the tiny approximation I mention later. We draw 1-2 million sequences, with even more water in between (so we actually draw much more often and miss most of the time). That effect is completely negligible.