# Close to great derivation: dna substitution

Tags:
1. Sep 29, 2011

### stabu

Hi,

I'm really close to deriving something that many biology textbooks like to skip: the probability of a certain DNA molecule (base) remaining the same while it suffers changes both to and from the other 3 types of DNA molecules.

This sounds specificially biological, but as probably a good deal of you know, it can be applied to a ton of other situations: migrations in and out of cities, influx and outflux of liquids, growth and decay and lots more besides. DNA substitution is also nice to treat however.

However, I am stuck at a certain point, mainly that I am not getting the right answer, and I am definitely missing something, so if anybody could shine any light, especially in relation to my missteps, I'd appreciate it.

Ok, let's start. We have a DNA site which happens to be a "C", a cytosine molecule. Over time, the site changes into either of the other three DNA molecules, "A", "G", or "T". We want to calculate $P_c(t)$, which is the probability of the site being a "C" at time $t$

Now, let's characterise a rate that we can use: we will say it is instantaneous, which is somewhat disconcerting in reality, because we'd be unable to count changes in a split instant of time, but we're in the world of theory right now, so it's possible to give it a symbol $\alpha$. I already know the solution, and it makes pefect sense: it's

$P_c(t)=\frac{1}{4} +\frac{3}{4} e^{-4 \alpha t}$

It make sense, because the $P_c(t)$ will eventually settle to a steady state $\frac{1}{4}$, but in the beginning the $\frac{3}{4}$ part will decrease exponentially with time. But, I want to derive this from (fairly) first principles.

So, we're going to observe a site for a certain unit of time, and we say we'll start at $t = 0$.

So we start at the beginning with our "C". What's $P_c(t)$ at $t= 0$? Easy one:

$P_c(0)=1$

Great, let's move on to our second time unit.

$P_c(1)=1-3\alpha$

i.e. during this time, there's been "move" to the other bases, each at $\alpha$. Let's move on

$P_c(2) = (1-3 \alpha) P_c(1) + \alpha (1-P_c(1))$

Here we get the two main phenomena at work: we continue to lose chances of staying at "C" by $3 \alpha$, but the bases that now are not "C" will also suffer a rate of change back into "C" again. This second step allows us to generalise to any time point $t$ and a time advance $\triangle t$. We must replace our $\alpha$ with $\alpha \triangle t$ to get

$P_c(t + \triangle t) = (1-3 \alpha \triangle t) P_c(t) + \alpha \triangle t (1-P_c(t))$

which is

$P_c(t + \triangle t) = P_c(t) - 4 \alpha \triangle t P_c(t) + \alpha \triangle t$

and also

$\frac{P_c(t + \triangle t) - P_c(t)}{\triangle t} = \alpha - 4 \alpha P_c(t)$

So taking limits we get

$\frac{d P_c(t)}{dt} = \alpha ( 1 - 4 P_c(t))$

And we can start gearing up to do an integration as we re-arrange:

$\frac{d P_c(t)}{1 - 4 P_c(t)} = \alpha dt$

At this point, I'm going to drop the pseudo-authoritative tone, which I guess you noticed, and admit that I'm getting unsteady at this point because I'm unsure whether to go for indefinite integral or start apply some limits. However, there is time for a substitution before I make that nasty decision:

Our integral:

$\int \frac{d P_c(t)}{1 - 4 P_c(t)} = \int \alpha dt$

Let $Q = 1 -4 P_c(t)$, so that

Let $dQ = -4 dP_c(t)$

and $dP_c(t) = - \frac{dQ}{4}$

which we apply to the above integral to get

$- \frac{1}{4} \int \frac{dQ}{Q} = \int \alpha dt$

OK, time to do the dirty. I'm going to go for indefinite integrals and combine constant terms.

$\ln |Q| = - 4 \alpha t + C$

which is

$\ln |1-4 P_c(t)| = - 4 \alpha t + C$

So,

$|1-4 P_c(t)| = e^{-4 \alpha t} + C$

$4 P_c(t) = 1 - e^{-4 \alpha t} + C$

Initial condition:

$P_c(0) = 1$, so $C = 4$

And so,

$P_c(t) = \frac{1}{4} - \frac{1}{4} e^{-4 \alpha t} + 1$

Which is not the right answer, though it would be if that final 1 was also multiplied by
$e^{-4 \alpha t}$

But my suspect way of integrating has not given this.

Which leads me to my humble request for guidance on this.

Last edited: Sep 29, 2011
2. Sep 29, 2011

### stabu

Me again, whoops, while editting a post, I seem to have lost the maths notation, so I tediously repeat the whole post: let's hope it works and that I don't bore the teeth out of people:

re-rectification. No it came out properly. Thanks for reading.

3. Sep 29, 2011

### HallsofIvy

This board currently has a mildly annoying bug- if you edit a post the "LaTeX" will not automatically format. Fortunately, it is "mild"- just click on the "refresh" button on you internet reader and it will format. (Unless, of course, you have incorrect code.)

4. Sep 29, 2011

### stabu

Yep, that's true, an annoyance rather than a bug.

In relation to my derivation, it's drilling a little hole in my head, because it's only a small thing in the way of finshing it. I'm definitely scanning alot of websites, but I'm getting no joy. Each time, I have to read over the same old integration basics, in the hope that somebody will drop in a the key insight.

Cheers.

5. Sep 29, 2011

### Stephen Tashi

Explain that step.

6. Sep 29, 2011

### Bacle

I don't understand the jump from:

ln| 1- 4Pc(t)|= -4at+ C

into |1-4Pc(t)|= e^(-4at)+C ;

Shouldn't it be:

from:

ln|1-4Pc(t)|=-4at+C ; raise both sides to e:

|1-4Pc(t)|=e^C *e^(-4at) ?

(Sorry, I have put of learning yet another version of Tex)

7. Sep 30, 2011

### stabu

Oh great, some replies and interest. Thanks guys.

Stephen Tashi, I'll reply to you first, first in explanatory words, and then using the general term. $P_c(1)$ represents the probability of of the site having a "C" after the very first unit time step. The site started as "C" to begin with, so during this single unit time, the rate change has been at work, imposing a migration to the other three possible bases or symbols: "A", "G" or "T" at a rate $\alpha$, for each. You could say that each of the other three bases try to pull the original "C" towards themselves at rate $\alpha$, so you have to multiply $\alpha$ by 3 to reflect this, and it must be a subtraction, because you are taking away from the original "C".

The general term (and sticking to our unit time steps) is:

$P_c(t +1) = (1-3 \alpha) P_c(t) + \alpha (1-P_c(t))$

So if you set $t = 0$, you get:

$P_c(1) = (1-3 \alpha) P_c(0) + \alpha (1-P_c(0))$

But $P_c(0) = 1$ because we started with a C, so that means

$P_c(1)=1-3\alpha$

Bacle, you are exactly right and it's actually down to laziness on my part that I didn't say $e^C$. When using indefinite integrals in this way I use capital $C$ (not be be confused in this case with the symbol "C", cystosine) almost as a placeholder, meaning "any constant value which is independent of the changing variables". Certainly $C \neq e^C$ but I continue to call it $C$ because of its role in the problem. It's changed value, sure, and technically the equations are incorrect, but I'm not interested in how $C$ evolves. I will look at it in due course, at the end, but in the middle of the derivation, I want to pay attention to the other terms.

If anybody thinks that is bad and / or erroneous, I'd be happy to mend my ways if I heard why.

In any case, using indefinite integrals doesn't fill my heart with unremitting joy either, so in the meantime, I've tried to use definite ones. This means deciding on the starting and ending states. Now, I'm a little rusty here, but I don't want there to be an fixed end state, I want it to be a variable. So I'm going to define $\tau$ as variable in a certain point of time $t$. I hope I'm allowed get away with that. So picking up the argument again with this equation

$\frac{d P_c(t)}{1 - 4 P_c(t)} = \alpha dt$

the right hand side will be

$\int^\tau_0 \alpha dt$

which is easy and simply equals $\alpha \tau$ with no constant term.

On the left hand side, we will have

$\int^{P_c(\tau)}_1 \frac{d P_c(t)}{1 - 4 P_c(t)} = \alpha dt$

because at $t=0$, $P_c(t) = 1$. Of course, we also carried out a substitution at this point, so I need to calculate these values for Q.

Because our substitution was $Q = 1 -4 P_c(t)$, then at $t=0$, $Q=-3$, and when $t = \tau$,then $Q = 1 -4 P_c(\tau)$.

So, taking the right hand side as integrated already, I can write the integration equation as:

$- \frac{1}{4} \int^{1 -4 P_c(\tau)}_{-3} \frac{dQ}{Q}$ and integrate:

$- \frac{1}{4} \ln |Q| |^{1 -4 P_c(\tau)}_{-3}$

Now the appearance of $-3$ on the lower bound looks threatening, but I'm going to be able to lean on the idea that I can simply use the absolute value, i.e. $+ 3$ when this is set against a logarithm. In a way, this felt almost too convenient and googling up this little detail came up with very few people questioning it, but the bigger problem is the upper bound, because it's a variable. Sometimes it's positive, other times it's negative, but I also want to lose the absolute value (the norm) operation, so I get stuck here, but headstrong as ever, I continue regardless and decide to make it disappear (gosh, I feel proud of myself):

$- \frac{1}{4} ( \ln(1 -4 P_c(\tau)) -\ln(3) ) = \alpha \tau$

which can also be put as:

$- \frac{1}{4} \ln(\frac{1 -4 P_c(\tau)}{3} = \alpha \tau$

$\ln\frac{1 -4 P_c(\tau)}{3} = - 4 \alpha \tau$

and

$1 -4 P_c(\tau) = 3 e^{-4 \alpha \tau}$

So, I can say

$4 P_c(\tau) = 1- 3 e^{-4 \alpha \tau}$

and so

$P_c(\tau) = \frac{1}{4} - \frac{3}{4} e^{-4 \alpha \tau}$

which is almost the right answer except for the minus sign! So, it means I'm getting closer! In fact if, after the integration, when I got

$- \frac{1}{4} ( \ln(1 -4 P_c(\tau)) -\ln(3) )= \alpha \tau$

$- \frac{1}{4} ( \ln(4 P_c(\tau) -1) -\ln(3) )= \alpha \tau$

i.e. changing $4 P_c(\tau) -1$ to $1 - 4 P_c(\tau)$

I would get the the right answer, but that would appear to be cheating.

If anybody has any further questions or insights, I'd be very happy to hear them. Also I appreciate my posts are on the long side, so thank you for reading and try to follow it.

8. Sep 30, 2011

### Mute

You misunderstood Bacle's point. When you exponentiate both sides of your equation, the resulting right hand side should be $Ae^{-4\alpha t}$, where A = e^C. The right hand side that you wrote down was $e^{-4\alpha t} + C$, which is not the same. This is why your first attempt was incorrect.

For your second attempt, you do need to be careful when dealing with the absolute value sign, but I wouldn't call changing the order of the terms in the absolute value sign cheating: you know the absolute value has two possible results, and only one of them gives you the correct initial condition. The other gives $P_c(0) = -1/2$.

9. Oct 4, 2011

### stabu

OK, many thanks Mute, because that seems to clear it up.

Yes. I now see my error in the first attempt, and how correcting it gives the answer I wanted. That was a lame answer I gave Bacle! I need to be more careful there.

Having to use the absolute value when applying the logarithm is critical for this right answer. I'm going to try and get a more intuitive hold on it, because it doesn't feel quite right.

In amy case, I appear to have got there. Many thanks for all your help!

10. Oct 4, 2011

### Stephen Tashi

That doesn't give a mathematical explanation of why $P_C(t) = \frac{1}{4} + \frac{3}{4} e^{-4\alpha t}$ implies $P_C(1) = 1 - 3\alpha$. You are apparently using the approximation $e^{x} \approx 1 + x$ for small values of $x$.

11. Oct 4, 2011

### Mute

"Pc(1) = 1 - 3α" is really "Pc(Δt) = 1 - 3αΔt". Once the general difference equation for Pc(n) was obtained, the timestep was made explicit and taken to zero, giving the differential equation. So, "Pc(1) = 1 - 3α" is really equivalent to saying $P_c(dt) = 1/4 + 3/4e^{-\alpha t} \approx 1 - 3\alpha dt$.