- #1

DaanV

- 26

- 0

**0. Introduction**

This is not actually homework. This is from real-life work in life science. I have tried to break it down into basic maths, though it will require an extension once I figure out the basics.

The following part here will serve as an introduction to the subject. Feel free to skip over it if you don't feel like reading a wall of text.

As said, I work in the Life Sciences. Next Generation Sequencing (NGS), to be more precise. We start with genomic DNA (~3.3 billion basepairs (bp)) which we randomly shear into fragments of ~150-200 bp. The idea here is that each of the resulting fragments is

**unique**.

After some biochemical steps (which I won't tire you with) we perform some cycles of polymerase chain reaction (PCR). You don't have to grasp the process underlying PCR, suffice to say it makes

**duplicates**of the original fragments. All samples undergo the same number of PCR cycles, resulting (presumably) in the same ratio of duplication.

All fragments are in the same pool. We then perform NGS on that pool. NGS is very sensitive to overloading the DNA, so we strive to always put in

**the same total number of fragments**(by varying the fraction of the total pool that we put in. E.g. if it has high concentration DNA we put in a small fraction of the pool and vice versa).

Again, NGS does not need to be understood. All you need to know is that we can derive the sequence of the bases (G, A, T and C) from all the fragments that are in the pool. These fragment sequences are then mapped to the human genome. The ones that map to a unique position of the genome (virtually all of them) are called 'reads'.

Now, some fraction of the reads that we obtain will be

**unique**, and the remaining fraction will be a

**duplicate**of those unique reads. Obviously, the maximum level of duplication is obtained when you put in the whole pool, at which point you get the duplication rate as it is present in the original pool. The smaller the fraction of the pool that you put into the machine, the lower your duplication rate will be. Plotting our data, I have found that the relation between duplication rate and pool fraction is eerily close to logarithmic (##R^2>0.98##). I strive to understand why this is.

I have tried to break it down to simpler examples first. This is what I will first be asking help on, before expanding it to the real life situation.

1. Homework Statement

1. Homework Statement

__Ultimate problem:__

From a pool of N fragments (or marbles, if you will), with U unique identities (the rest N-U are duplicates of these) draw a number K.

What is the relation between:

- the duplication rate in the pool (##\frac{N-U}{N}##) and the fraction of the pool drawn (##\frac{K}{N}##)

versus

- the duplication rate found in K (described by ##\frac{K-U_K}{K}##)?

__Simplification #1:__

Let's assume a pool with N=10 and U=5. Furthermore, each unique has exactly one duplicate. E.g. 10 marbles with 5 different colours, each of which is present twice (two yellow, two blue, two green, two red, two purple).

What's the relation between the real duplication rate, % of pool drawn and the found duplication rate?

__Simplification #2:__

Assume a pool with N=15 and U=5. This time each unique has exactly two duplicates (three yellow, three blue, etc).

What's the relation between the real duplication rate, % of pool drawn and the found duplication rate?

## Homework Equations

Basic probability statistics.

## The Attempt at a Solution

__Simplification #1:__

The real duplication rate (in the pool) ##\frac{N-U}{N}=\frac{10-5}{10}=\frac{1}{2}##.

*Number of duplicates*

1) If I draw one marble, I have no duplicates.

2) If I draw a second marble, there's probability 1/9 of drawing the same colour as the first, and probability 8/9 of drawing a second unique colour.

3a) If the second marble was a duplicate, there's probability 8/8 of drawing a second unique colour.

3b) If the second marble was unique, there's probability 2/8 of drawing a duplicate to either colour, and probability 6/8 of drawing a third unique colour.

.. etc.

*Duplication rate*

For the duplication rate, I simply multiply the probability of finding D duplicates (##D_K=K-U_K##) with the number D, sum over all D and divide by K.

I found that the duplication rate in this case can be described by ##\frac{K-1}{2*(N-1)}##. I.e. a linear equation. But why?

__Simplification #2:__

The real duplication rate (in the pool) ##\frac{N-U}{N}=\frac{15-5}{15}=\frac{2}{3}##.

*Number of duplicates*

1) One marble, no duplicates

2) Second marble, probability 2/14 of drawing first duplicate, 12/14 for second unique colour.

3a) If second was a duplicate, probability 1/13 of drawing another duplicate, 12/13 for second unique colour.

3b) If second was unique, probability 4/13 of drawing first duplicate, 9/13 for third unique.

... etc.

*Duplication rate*

Calculation same as before.

I found that the duplication rate in this case fits with a second order polynomial (##R^2=1##).

Similarly if I enter 4 marbles of the same colour, it fits (##R^2=1##) to a third order polynomial. My problem is that I don't see why this must be so.

**4. What I want**

I'd first like to understand the found relationships for the first and second simplifications. Why are they 1st, 2nd and 3rd order polynomial? Does this relation expand (e.g. if I have 1000 marbles of each colour, will it resemble a 999th order polynomial)?

From there, I would like to know why the data I found seems to fit to a logarithmic trendline.

Thanks in advance for any help provided!

Feel free to ask additional questions if any of the above is unclear.