Sequence Analysis (probability)

In summary, a 4-letter DNA sequence appears 10 times in a 1027-letter strand. The probability of finding this sequence at any random position is 1/256, and the Z-score of this observation is 4.
  • #1
bowlbase
146
2

Homework Statement


I have a 4-letter DNA sequence (AGGA) that appears 10 times in a strand that is 1027 letters long. The probability of finding this sequence at any random position is 1/256. What is the Z-score of this observation?

Homework Equations


##Z=\frac{(X-E(X))}{\sqrt{Var(X)}}##

The Attempt at a Solution


First I need to find out E(X). From a strand of 1027 there will be 1024 possible 4-letter positions (I found this by seeing how many were available in a 10 letter long strand and expanded it (n-3)). Multiplying the total by the probability of finding it in any position I get 1024/256=4. So E(X)=4. The actual found was 10 so I just need to square root of the variance. ##V(X)=E(x^2)-E(x)^2##. I have E(X) so I just need to find ##E(X^2)##.

As before this is where I seem to struggle. I'm not given a distribution or anything. All I know is the probability of finding this sequence in a random position in my current strand: 10/1024=0.009766. But I don't think this is what I should use for finding the variance as it would give me a negative result for the z value. My other idea was to just say that for this strand the variance is 6 (because 10 is 6 from the mean of 4). Then ##\sqrt{6}=2.44## giving ##\frac{10-4}{2.44}=2.44## which is more along the lines of what I expected.

Is this second method correct or is there something else that I'm not figuring out here?

Thanks for the help.
 
Last edited:
Physics news on Phys.org
  • #2
bowlbase said:
The probability of finding this sequence at any random position is 1/256.
Let me guess: the 1/256 is the expectation value, the 10 is your observed value and you want to test this for significance?

If you have 1024 independent trials with 1/256 probability each, you get a Poisson distribution. The variance is equal to the expectation value of 4.
Your trials are not completely independent because a sequence starting at position 4 (for example) makes a sequence starting at position 5 impossible, but this effect is small (~2%).
 
  • #3
bowlbase said:

Homework Statement


I have a 4-letter DNA sequence that appears 10 times in a strand that is 1027 letters long. The probability of finding this sequence at any random position is 1/256.

Are the specific letters in the DNA sequence given?

The problem statement should include what the problem asks you to do with the given information.
 
  • #4
mfb said:
Let me guess: the 1/256 is the expectation value, the 10 is your observed value and you want to test this for significance?

Half correct. THe 1/256 isn't the expectation. Rather it is the probability of finding the 4 letter sequence at any given position within the sequence.

Stephen Tashi said:
Are the specific letters in the DNA sequence given?

The problem statement should include what the problem asks you to do with the given information.

Sorry, I kept getting interrupted as I was writing this (potty training my son). The question is asking just for the "Z-score" (standard score) of the observation. It does, in fact, give an actual sequence: AGGA. I didn't mention it as I didn't believe it was important to the calculations. There is a hint that I should assume that the occurrence of the pattern at any position is independent of it occurring anywhere else. I think this works with my 1024 maximum (AGGAGGA would count as two).
 
  • #5
bowlbase said:
Half correct. THe 1/256 isn't the expectation. Rather it is the probability of finding the 4 letter sequence at any given position within the sequence.
Well, the expectation value for each position. Okay.
Not surprising, as there are 256 possible 4-letter sequences.
bowlbase said:
There is a hint that I should assume that the occurrence of the pattern at any position is independent of it occurring anywhere else.
Good, that is exactly the approximation I mentioned.
Then you get a Poisson distribution for the occurences of the sequence and the variance is equal to the expectation value.
 
  • #6
Okay, then using Poisson's:
##P(X)=\frac{\lambda^k e^{-\lambda}}{k!}=\frac{4^{10} e^{-4}}{10!}=0.00592##

So, the probability of 10 occurrences here is just 0.00592.How does this relate to the Z factor though? If the variance is indeed the same as the expectation then my answer is exactly "3" : ##\frac{10-4}{\sqrt{4}}## and I needn't have done the Poisson distribution.
 
  • #7
bowlbase said:
then my answer is exactly "3"
Right.
bowlbase said:
and I needn't have done the Poisson distribution.
You did, to find the variance.
 
  • #8
mfb said:
Right.
You did, to find the variance.

But every value I used for Poissons was already known from my earlier calculations and I didn't use the result for anything.. I feel like I missed something here... Was I suppose to use the 0.00529 to find the expectation with 1024 strand (1024*.00529=5.42)?
 
  • #9
bowlbase said:
Was I suppose to use the 0.00529 to find the expectation with 1024 strand (1024*.00529=5.42)?
?

If your 1027 letter DNA strand is completely random, the chance to find exactly 10 AGGA sequences in it is 0.53%. To get this value, you just need the probability to find AGGA at a specific location (1/256), the length of the strand and Poisson statistics.
You can also calculate the chance to find 10 or more (but that number is not so different from 0.53%).
 
  • #10
I'm struggling to understand the purpose of doing this when I just needed to get the Z score. I understand its significance otherwise.
 
  • #11
To get the Z score, you need the variance.
 
  • #12
bowlbase said:

Homework Statement


I have a 4-letter DNA sequence (AGGA) that appears 10 times in a strand that is 1027 letters long. The probability of finding this sequence at any random position is 1/256. What is the Z-score of this observation?

Homework Equations


##Z=\frac{(X-E(X))}{\sqrt{Var(X)}}##

The Attempt at a Solution


First I need to find out E(X). From a strand of 1027 there will be 1024 possible 4-letter positions (I found this by seeing how many were available in a 10 letter long strand and expanded it (n-3)). Multiplying the total by the probability of finding it in any position I get 1024/256=4. So E(X)=4. The actual found was 10 so I just need to square root of the variance. ##V(X)=E(x^2)-E(x)^2##. I have E(X) so I just need to find ##E(X^2)##.

As before this is where I seem to struggle. I'm not given a distribution or anything. All I know is the probability of finding this sequence in a random position in my current strand: 10/1024=0.009766. But I don't think this is what I should use for finding the variance as it would give me a negative result for the z value. My other idea was to just say that for this strand the variance is 6 (because 10 is 6 from the mean of 4). Then ##\sqrt{6}=2.44## giving ##\frac{10-4}{2.44}=2.44## which is more along the lines of what I expected.

Is this second method correct or is there something else that I'm not figuring out here?

Thanks for the help.

bowlbase said:
I'm struggling to understand the purpose of doing this when I just needed to get the Z score. I understand its significance otherwise.

I think the problem’s proposer is asking you to assume you have 1027 (or maybe 1024) independent trials with ‘success” probability p = 1/256 in each trial, then to compute an associated z-score. In fact, if ##X_n## = the number of successes in n independent trials with success probability ##p## in each trial, the probability distribution of ##X_n## is a well-known, standard distribution that appears in every introductory probability course. The mean and variance of ##X_n## are given by simple formulas in terms of ##n## and ##p##, so the z-score for the observation ##\{X_n = 10 \}## is readily computable for ##p = 1/256## and ##n = 1024## or ##n = 1027##. The fact is, though, that using a z-score in this problem is inappropriate, so the person proposing the problem is really asking for a useless computation.As has already been suggested, the proposer’s description leads to a good approximation of ##X_n## by a simpler Poisson random variable ##Y_n## with mean ##m = 4## (when ##n = 1024##) or ##m = 4.0117## (when ##n = 1027##). Computing things like ##P(Y_n = 10), \; P(Y_n \leq 10)## and ##P(Y_n > 10)## is straightforward enough without excessive computational effort, but trying to get at them by using a z-score would lead to significant errors.More seriously: the very problem description proposed is off-kilter: the events “occurrence of AGGA” are far from independent, and are not even approximately so. The precise analysis involves the quite “advanced” tools of Renewal Theory and/or Markov chains. In the following I will outline how to deal with the problem, although you may not have enough background to follow the development.

Basically, you need to look at the probability distributions of the first position where final ‘A’ in AGGA occurs (which can be at positions 4,5,6,…) then the inter-occurrence distributions between the remaining occurrences (which can be at 3,4,5,… positions farther on), because you allow the final ‘A’ of one occurrence to be a potential starting ‘A’ of a new occurrence. Assuming that each letter ‘A’, ‘G’, ‘other 1’ and ‘other 2’ occurs independently with probability 1/4 at each position, we can develop a 5-state Markov chain to describe occurrences of our pattern of interest, and the 'times' ##T_1## and ##T_2## we want are the first-passage times to state 5, starting from state 1 or state 2. We can get numerical values for the distributions ##f_1(k), k = 1, \ldots, N## of ##T_1## and ##f_2(k), k = 1,2, \ldots, N## of ##T_2##, where ##N## is some large integer ##> 1027##; I used ##N = 1051## in my computations, but a smaller ##N## would do as well. Now, if ##T’_2, \ldots, T’_{10}## are nine statistically independent copies of ##T_2## (each with distribution ##f_2##) then

[tex]S_{10} = T_1 + \sum_{i=2}^{10} T’_i [/tex]

is the 'time' (= position) at which the 10th occurrence of AGGA occurs. The distribution ##f_{10}## of ##S_{10}## is the 10-fold convolution
[tex] f_{10} = f_1 * f_2 * f_2 * \cdots * f_2, [/tex]
where ##*## denotes convolution and there are 9 factors ##f_2##. From this, we can obtain

[tex] P(S_{10} \leq 1027) = \sum_{k=1}^{1027} f_{10}(k) = .0087902693 [/tex]
and
[tex] P(\text{Number occurrences = } 10) = P(S_{10} \leq 1027 \; \;, \;\; S_{11} > 1027 ) \\
= \sum_{j=1}^{1027} f_{10}(j) P(T_2 > 1027-j) = 0.0026318995 [/tex]
This is the probability of observing exactly 10 occurrences of AGGA in the possible ending positions 4,5,…,1027. It is quite a bit less than the value 0.00592 that is obtained from the Poisson approximation.
 
Last edited:
  • #13
How did you get that number?
We end a sequence with A, that increases the probability that the next sequence starts directly (and gives a deviation larger than the 2% I estimated earlier - which then is even more increased by the fact that we look for ~3sigma outliers). In the extreme case, something like "AAAA" would give a much higher rate because the next sequence has 1/4 probability to start right after the previous one.

I made a 50-state markov chain in excel and I get 0.009123921 for at least 10 occurences.

Other probabilities:
0: 0.017
1: 0.070
2: 0.142
3: 0.192
4: 0.195
5: 0.158
6: 0.107
7: 0.062
8: 0.031
9: 0.014
 
Last edited:
  • #14
mfb said:
How did you get that number?
We end a sequence with A, that increases the probability that the next sequence starts directly (and gives a deviation larger than the 2% I estimated earlier - which then is even more increased by the fact that we look for ~3sigma outliers). In the extreme case, something like "AAAA" would give a much higher rate because the next sequence has 1/4 probability to start right after the previous one.

I made a 50-state markov chain in excel and I get 0.009123921 for at least 10 occurences.

Other probabilities:
0: 0.017
1: 0.070
2: 0.142
3: 0.192
4: 0.195
5: 0.158
6: 0.107
7: 0.062
8: 0.031
9: 0.014

I viewed occurrences of AGGA as a delayed renewal process. The time (number of spaces) until the first occurrence is ##T_0##, which has pmf ##f_0 = (3/4) f_1 + (1/4) f_2##, while the times between successive later occurrences has pmf ##f_2##. Here, ##f_1, f_2 ## are the first-passage pmfs from states 1 and 2 to state 5 in the following 5-state Markov chain.

The Markov chain.
State 1 = start, or start over; state 2 = A (first A after starting), state 3 = AG, state 4 = AGG, and state 5 = AGGA (stop).
The 1-step transition probability matrix is
[tex] P = \pmatrix{3/4 & 1/4 & 0 & 0 & 0\\ 1/2 & 1/4 & 1/4 & 0 & 0 \\ 1/2 & 1/4 & 0 & 1/4 & 0 \\ 3/4 & 0 & 0 & 0 & 1/4 \\ 0 & 0 & 0 & 0 & 1}[/tex]
Let ##f_i(n)## denote the probability that ##n## is the first time to reach state 5 from state ##i##. We have ##f_i(1) = p_{i,5}, i = 1,2,3,4## and
[tex] f_i(n+1) = \sum_{j < 5} p_{ij} f_j(n), \; i=1, \ldots, 4 [/tex]
This can be implemented numerically by letting ##f(n) = ## column vector ##(f_i(n), i=1,\ldots, 4)##, and by letting ##Q=## the ##4 \times 4## submatrix of ##P## consisting of rows and columns 1--4. We then have
[tex] f(n+1) = Q f(n), n = 1,2, \ldots[/tex]
with ## f(1) = (0,0,0,1/4)^T ##. The pmf of ##T_1## is ##\{ f_1(n), n = 1,2, \ldots \}##, while the pmf of ##T_2## is ##\{ f_2(n), n = 1,2, \ldots \}##. We just compute these up to ##n = 1230##.

To find the probability that 10 occurrences take place in the interval ##t = 1, 2, \ldots, 1027## we start by letting
[tex] S_{10} = T_0 + \sum_{j=1}^{9} T'_j, [/tex]
which is the number of steps needed to reach 10 occurrences of the pattern. Here, ##T_0, T'_1, T'_2, \ldots ## are independent, ##T_0## has df ##f_0## and the ##T'_k## have df ##f_2##. We take ##f_0 = (3/4) f_1 + (1/4) f_2## because if the first letter is not 'A' we start the Markov chain from state 1, while if the first letter is 'A' we start from state 2. The event ##\{ S_{10} \leq 1027 \} ## occurs when the 10th pattern is on or before time 1027 (so the number occurring on or before 1027 is ##\geq 11##). The event ##\{ \text{No. of occurrences } = 10 \} = \{ S_{10} \leq 1027 \} \cap \{ S_{11} > 1027\}##, so has probability
[tex] \sum_{j=1}^{1027} P(S_{10} = j) P(T_2 > 1027 - j) = \sum_{j=1}^{1027} f_{10}(j) \left(1-\sum_{k=1}^{1027-j} f_2(k) \right) [/tex]

I just did all the computations in Maple.
 
Last edited:
  • #15
I don't see how you get the distribution of the sum of 9 Ti without folding them in some way (or making 9 sums).

Split up into "exactly 10" and "more than 10", I get 0.005874776 and 0.003249145. And I'm quite sure that result is correct (with rounding errors smaller than the difference between our results).
 
  • #16
mfb said:
I don't see how you get the distribution of the sum of 9 Ti without folding them in some way (or making 9 sums).

Split up into "exactly 10" and "more than 10", I get 0.005874776 and 0.003249145. And I'm quite sure that result is correct (with rounding errors smaller than the difference between our results).

I obtained the distribution of ##S_{10}## by the multiple convolution ##f_{10} = f_0 * f_2 * f_2 * \cdot * f_2##, where there are 9 ##f_2## factors. I just did the convolutions numerically.
 

1. What is sequence analysis in probability?

Sequence analysis in probability is a statistical method used to analyze and understand sequences of events or observations. It involves determining the probability of a sequence of events occurring based on previous observations or data.

2. What is the purpose of sequence analysis in probability?

The purpose of sequence analysis in probability is to identify patterns and trends in data to make predictions about future events. It can also be used to determine the likelihood of a particular sequence of events happening.

3. What are some common applications of sequence analysis in probability?

Sequence analysis in probability is commonly used in various fields, including genetics, linguistics, finance, and computer science. It can be used to analyze DNA sequences, language patterns, stock market trends, and computer program outputs, among others.

4. What are the main techniques used in sequence analysis in probability?

Some of the main techniques used in sequence analysis in probability include Markov chains, hidden Markov models, and Bayesian networks. These methods rely on mathematical models and algorithms to analyze and predict sequences of events.

5. What are the limitations of sequence analysis in probability?

One limitation of sequence analysis in probability is that it assumes independence between events, which may not always be the case in real-world situations. Additionally, it may be challenging to accurately model complex sequences, and the results may be sensitive to the initial assumptions and data used.

Similar threads

  • Precalculus Mathematics Homework Help
Replies
1
Views
1K
  • Precalculus Mathematics Homework Help
Replies
8
Views
419
  • Precalculus Mathematics Homework Help
Replies
14
Views
2K
  • Precalculus Mathematics Homework Help
Replies
2
Views
857
  • Precalculus Mathematics Homework Help
Replies
11
Views
734
  • Precalculus Mathematics Homework Help
Replies
1
Views
2K
  • Precalculus Mathematics Homework Help
Replies
4
Views
515
  • Precalculus Mathematics Homework Help
Replies
14
Views
1K
  • Biology and Medical
Replies
1
Views
964
  • Precalculus Mathematics Homework Help
Replies
7
Views
2K
Back
Top