# Sequence Analysis (probability)

1. Jan 31, 2015

### bowlbase

1. The problem statement, all variables and given/known data
I have a 4-letter DNA sequence (AGGA) that appears 10 times in a strand that is 1027 letters long. The probability of finding this sequence at any random position is 1/256. What is the Z-score of this observation?

2. Relevant equations
$Z=\frac{(X-E(X))}{\sqrt{Var(X)}}$

3. The attempt at a solution
First I need to find out E(X). From a strand of 1027 there will be 1024 possible 4-letter positions (I found this by seeing how many were available in a 10 letter long strand and expanded it (n-3)). Multiplying the total by the probability of finding it in any position I get 1024/256=4. So E(X)=4. The actual found was 10 so I just need to square root of the variance. $V(X)=E(x^2)-E(x)^2$. I have E(X) so I just need to find $E(X^2)$.

As before this is where I seem to struggle. I'm not given a distribution or anything. All I know is the probability of finding this sequence in a random position in my current strand: 10/1024=0.009766. But I don't think this is what I should use for finding the variance as it would give me a negative result for the z value. My other idea was to just say that for this strand the variance is 6 (because 10 is 6 from the mean of 4). Then $\sqrt{6}=2.44$ giving $\frac{10-4}{2.44}=2.44$ which is more along the lines of what I expected.

Is this second method correct or is there something else that I'm not figuring out here?

Thanks for the help.

Last edited: Jan 31, 2015
2. Jan 31, 2015

### Staff: Mentor

Let me guess: the 1/256 is the expectation value, the 10 is your observed value and you want to test this for significance?

If you have 1024 independent trials with 1/256 probability each, you get a Poisson distribution. The variance is equal to the expectation value of 4.
Your trials are not completely independent because a sequence starting at position 4 (for example) makes a sequence starting at position 5 impossible, but this effect is small (~2%).

3. Jan 31, 2015

### Stephen Tashi

Are the specific letters in the DNA sequence given?

The problem statement should include what the problem asks you to do with the given information.

4. Jan 31, 2015

### bowlbase

Half correct. THe 1/256 isn't the expectation. Rather it is the probability of finding the 4 letter sequence at any given position within the sequence.

Sorry, I kept getting interrupted as I was writing this (potty training my son). The question is asking just for the "Z-score" (standard score) of the observation. It does, in fact, give an actual sequence: AGGA. I didn't mention it as I didn't believe it was important to the calculations. There is a hint that I should assume that the occurrence of the pattern at any position is independent of it occurring anywhere else. I think this works with my 1024 maximum (AGGAGGA would count as two).

5. Jan 31, 2015

### Staff: Mentor

Well, the expectation value for each position. Okay.
Not surprising, as there are 256 possible 4-letter sequences.
Good, that is exactly the approximation I mentioned.
Then you get a Poisson distribution for the occurences of the sequence and the variance is equal to the expectation value.

6. Jan 31, 2015

### bowlbase

Okay, then using Poisson's:
$P(X)=\frac{\lambda^k e^{-\lambda}}{k!}=\frac{4^{10} e^{-4}}{10!}=0.00592$

So, the probability of 10 occurrences here is just 0.00592.How does this relate to the Z factor though? If the variance is indeed the same as the expectation then my answer is exactly "3" : $\frac{10-4}{\sqrt{4}}$ and I needn't have done the Poisson distribution.

7. Jan 31, 2015

### Staff: Mentor

Right.
You did, to find the variance.

8. Jan 31, 2015

### bowlbase

But every value I used for Poissons was already known from my earlier calculations and I didn't use the result for anything.. I feel like I missed something here.... Was I suppose to use the 0.00529 to find the expectation with 1024 strand (1024*.00529=5.42)?

9. Jan 31, 2015

### Staff: Mentor

???

If your 1027 letter DNA strand is completely random, the chance to find exactly 10 AGGA sequences in it is 0.53%. To get this value, you just need the probability to find AGGA at a specific location (1/256), the length of the strand and Poisson statistics.
You can also calculate the chance to find 10 or more (but that number is not so different from 0.53%).

10. Jan 31, 2015

### bowlbase

I'm struggling to understand the purpose of doing this when I just needed to get the Z score. I understand its significance otherwise.

11. Feb 1, 2015

### Staff: Mentor

To get the Z score, you need the variance.

12. Feb 1, 2015

### Ray Vickson

I think the problem’s proposer is asking you to assume you have 1027 (or maybe 1024) independent trials with ‘success” probability p = 1/256 in each trial, then to compute an associated z-score. In fact, if $X_n$ = the number of successes in n independent trials with success probability $p$ in each trial, the probability distribution of $X_n$ is a well-known, standard distribution that appears in every introductory probability course. The mean and variance of $X_n$ are given by simple formulas in terms of $n$ and $p$, so the z-score for the observation $\{X_n = 10 \}$ is readily computable for $p = 1/256$ and $n = 1024$ or $n = 1027$. The fact is, though, that using a z-score in this problem is inappropriate, so the person proposing the problem is really asking for a useless computation.

As has already been suggested, the proposer’s description leads to a good approximation of $X_n$ by a simpler Poisson random variable $Y_n$ with mean $m = 4$ (when $n = 1024$) or $m = 4.0117$ (when $n = 1027$). Computing things like $P(Y_n = 10), \; P(Y_n \leq 10)$ and $P(Y_n > 10)$ is straightforward enough without excessive computational effort, but trying to get at them by using a z-score would lead to significant errors.

More seriously: the very problem description proposed is off-kilter: the events “occurrence of AGGA” are far from independent, and are not even approximately so. The precise analysis involves the quite “advanced” tools of Renewal Theory and/or Markov chains. In the following I will outline how to deal with the problem, although you may not have enough background to follow the development.

Basically, you need to look at the probability distributions of the first position where final ‘A’ in AGGA occurs (which can be at positions 4,5,6,…) then the inter-occurrence distributions between the remaining occurrences (which can be at 3,4,5,… positions farther on), because you allow the final ‘A’ of one occurrence to be a potential starting ‘A’ of a new occurrence. Assuming that each letter ‘A’, ‘G’, ‘other 1’ and ‘other 2’ occurs independently with probability 1/4 at each position, we can develop a 5-state Markov chain to describe occurrences of our pattern of interest, and the 'times' $T_1$ and $T_2$ we want are the first-passage times to state 5, starting from state 1 or state 2. We can get numerical values for the distributions $f_1(k), k = 1, \ldots, N$ of $T_1$ and $f_2(k), k = 1,2, \ldots, N$ of $T_2$, where $N$ is some large integer $> 1027$; I used $N = 1051$ in my computations, but a smaller $N$ would do as well. Now, if $T’_2, \ldots, T’_{10}$ are nine statistically independent copies of $T_2$ (each with distribution $f_2$) then

$$S_{10} = T_1 + \sum_{i=2}^{10} T’_i$$

is the 'time' (= position) at which the 10th occurrence of AGGA occurs. The distribution $f_{10}$ of $S_{10}$ is the 10-fold convolution
$$f_{10} = f_1 * f_2 * f_2 * \cdots * f_2,$$
where $*$ denotes convolution and there are 9 factors $f_2$. From this, we can obtain

$$P(S_{10} \leq 1027) = \sum_{k=1}^{1027} f_{10}(k) = .0087902693$$
and
$$P(\text{Number occurrences = } 10) = P(S_{10} \leq 1027 \; \;, \;\; S_{11} > 1027 ) \\ = \sum_{j=1}^{1027} f_{10}(j) P(T_2 > 1027-j) = 0.0026318995$$
This is the probability of observing exactly 10 occurrences of AGGA in the possible ending positions 4,5,…,1027. It is quite a bit less than the value 0.00592 that is obtained from the Poisson approximation.

Last edited: Feb 1, 2015
13. Feb 1, 2015

### Staff: Mentor

How did you get that number?
We end a sequence with A, that increases the probability that the next sequence starts directly (and gives a deviation larger than the 2% I estimated earlier - which then is even more increased by the fact that we look for ~3sigma outliers). In the extreme case, something like "AAAA" would give a much higher rate because the next sequence has 1/4 probability to start right after the previous one.

I made a 50-state markov chain in excel and I get 0.009123921 for at least 10 occurences.

Other probabilities:
0: 0.017
1: 0.070
2: 0.142
3: 0.192
4: 0.195
5: 0.158
6: 0.107
7: 0.062
8: 0.031
9: 0.014

Last edited: Feb 1, 2015
14. Feb 1, 2015

### Ray Vickson

I viewed occurrences of AGGA as a delayed renewal process. The time (number of spaces) until the first occurrence is $T_0$, which has pmf $f_0 = (3/4) f_1 + (1/4) f_2$, while the times between successive later occurrences has pmf $f_2$. Here, $f_1, f_2$ are the first-passage pmfs from states 1 and 2 to state 5 in the following 5-state Markov chain.

The Markov chain.
State 1 = start, or start over; state 2 = A (first A after starting), state 3 = AG, state 4 = AGG, and state 5 = AGGA (stop).
The 1-step transition probability matrix is
$$P = \pmatrix{3/4 & 1/4 & 0 & 0 & 0\\ 1/2 & 1/4 & 1/4 & 0 & 0 \\ 1/2 & 1/4 & 0 & 1/4 & 0 \\ 3/4 & 0 & 0 & 0 & 1/4 \\ 0 & 0 & 0 & 0 & 1}$$
Let $f_i(n)$ denote the probability that $n$ is the first time to reach state 5 from state $i$. We have $f_i(1) = p_{i,5}, i = 1,2,3,4$ and
$$f_i(n+1) = \sum_{j < 5} p_{ij} f_j(n), \; i=1, \ldots, 4$$
This can be implemented numerically by letting $f(n) =$ column vector $(f_i(n), i=1,\ldots, 4)$, and by letting $Q=$ the $4 \times 4$ submatrix of $P$ consisting of rows and columns 1--4. We then have
$$f(n+1) = Q f(n), n = 1,2, \ldots$$
with $f(1) = (0,0,0,1/4)^T$. The pmf of $T_1$ is $\{ f_1(n), n = 1,2, \ldots \}$, while the pmf of $T_2$ is $\{ f_2(n), n = 1,2, \ldots \}$. We just compute these up to $n = 1230$.

To find the probability that 10 occurrences take place in the interval $t = 1, 2, \ldots, 1027$ we start by letting
$$S_{10} = T_0 + \sum_{j=1}^{9} T'_j,$$
which is the number of steps needed to reach 10 occurrences of the pattern. Here, $T_0, T'_1, T'_2, \ldots$ are independent, $T_0$ has df $f_0$ and the $T'_k$ have df $f_2$. We take $f_0 = (3/4) f_1 + (1/4) f_2$ because if the first letter is not 'A' we start the Markov chain from state 1, while if the first letter is 'A' we start from state 2. The event $\{ S_{10} \leq 1027 \}$ occurs when the 10th pattern is on or before time 1027 (so the number occurring on or before 1027 is $\geq 11$). The event $\{ \text{No. of occurrences } = 10 \} = \{ S_{10} \leq 1027 \} \cap \{ S_{11} > 1027\}$, so has probability
$$\sum_{j=1}^{1027} P(S_{10} = j) P(T_2 > 1027 - j) = \sum_{j=1}^{1027} f_{10}(j) \left(1-\sum_{k=1}^{1027-j} f_2(k) \right)$$

I just did all the computations in Maple.

Last edited: Feb 1, 2015
15. Feb 1, 2015

### Staff: Mentor

I don't see how you get the distribution of the sum of 9 Ti without folding them in some way (or making 9 sums).

Split up into "exactly 10" and "more than 10", I get 0.005874776 and 0.003249145. And I'm quite sure that result is correct (with rounding errors smaller than the difference between our results).

16. Feb 1, 2015

### Ray Vickson

I obtained the distribution of $S_{10}$ by the multiple convolution $f_{10} = f_0 * f_2 * f_2 * \cdot * f_2$, where there are 9 $f_2$ factors. I just did the convolutions numerically.