Empirical tests on DNA sequence

EnumaElish · Jan 12, 2007

But if the sequence is random, you shouldn't. So under the null hypothesis (random DNA), the trick isn't "lossy."

yevi · Jan 12, 2007

You last statement is very interesting (for me) I am going to apply run tests with various cycling orders and see what I'll get :)

yevi · Jan 12, 2007

What about some "pattern matching" based, tests I can use?
Maybe "Poker"?

EnumaElish · Jan 12, 2007

Surely, you can apply any of these tests. (ANOVA is a more rigorous version of the frequency test.)

yevi · Jan 12, 2007

If you don't mind, I'll return to RUN test and bring a quote from a book
THE ART OF COMPUTER PROGRAMMING V 2, DONALD E. KNUTH

Run test. A sequence may also be tested for "runs up" and "runs down."
This means we examine the length of monotone subsequences of the original
sequence, i.e., segments that are increasing or decreasing.
As an example of the precise definition of a run, consider the sequence of ten
numbers "1298536704"; putting a vertical line at the left and right and between
X_j and X_j+1 whenever X_j >X_j+1, we obtain |1 2 9| 8|5| 3 6 7 |0 4|, which displays the "runs up": there is a run of length 3, followed by two runs of length 1, followed by another run of length 3, followed by a run of length 2.

In this test he doesn't use circular ordering. Why he uses only "run-up"?

EnumaElish · Jan 12, 2007

I suspect he is not using circularity because two endpoints (0 and 9) out of 10 are relatively few; as opposed to two in four. And since he is looking at run ups only, that's actually one out of 10.

A runs test can be set up to test the number of run ups, the # of run downs, the # of both ups and downs, the # of runs above/below the mean or the median, the # of runs of a predetermined length, the maximum run length, and I suppose many more. Each of these tests is based on a different way of constructing the random variable "a run" (except the last, where the r.v. is the "run length"). A random variable can be thought of as another name for a probability distribution. You can construct any test if you know the distribution that applies to that test.

EnumaElish · Jan 14, 2007

Please disregard #29 above.

Problems:
1. The regression equation will not measure what it is intended for,
2. Although for very long random sequences (e.g. > 1,000) each letter should appear a quarter of the time, for a relatively short random sequence there can be significantly more of one letter than another.

At the very least, this needs more thought on my part.

yevi · Jan 15, 2007

ok,very well

yevi · Jan 15, 2007

Now I am trying to apply the Frequency within a block test that was suggested by Prof.

For that I am using the chi as I wrote on the first page
(Count the frequency of a every letter (A,C,G,T) in sequence.
And apply a chi-square, where number of categories is 4, p =1/4 with number of observations that fall into every category and n as sequence length.)

My question is what should be the size of block (the input is 350-400)? And what results of each block-test I should expect to conclude that sequence is random?

thanks.

EnumaElish · Jan 15, 2007

[PLAIN said:

http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test]The[/PLAIN] approximation to the chi-square distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 10% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10. In this case, a better approximation can be had by reducing the absolute value of each difference between observed and expected frequencies by 0.5 before squaring; this is called Yates' correction.

See also: http://www.statisticssolutions.com/Chi_square_test.htm

You can also use regression analysis (ANOVA) to test:
1. whether the freq. of a letter within any block is equal to the freq. of the same letter in any other block,
2. whether the difference between the freqs. of two letters is significant within each block,
3. whether the difference between freqs. is different across blocks.

See attached Excel printout with 2 blocks. (These are t tests, so the sample size should be at least 20 letters per block.) The models are:

D1 = b2 + d1 block1 + u
D1 = b1 + d2 block2 + u
D1 = b1 block1 + b2 block2 + u (Constant [the intercept] is Zero)

D1-D2 = b2^* + d1^* block1 + u
D1-D2 = b1^* + d2^* block2 + u
D1-D2 = b1^* block1 + b2^* block2 + u (Constant [the intercept] is Zero)

which show d1 = d2 = d1^* = d2^* = 0 statistically (t-stat too low or p-value too high), so there is no difference between the blocks. (Although, b2^* is statistically significant, which implies a statistically significant difference between the freqs. of C and G within block 2.)

If you had 5 blocks, you could run:

D1 = b1 + d2 block2 + d3 block3 + d4 block4 + d5 block5 + u

where the estimated b1 coefficient is freq. of C in the 1st block and estimated dj coefficient (j > 1) is the difference between the freq. of C in the first block and the freq. of C in the j'th block.

To test whether b1 = 1/4, the "Y" variable was redefined as D1 - 1/4 as in Attachment 2:

D1-0.25 = b1^** block1 + b2^** block2 + u (Constant [the intercept] is Zero)

which shows that the freq. of C in either block is not statistically different from 0.25 at the 5% level of significance. (Note: bj^** = bj - 0.25 where bj is the expected freq. of C in the j'th block, as in the D1 models above.)

yevi · Jan 16, 2007

For frequency within a block test I prefer to use Chi-square (Pearson's), like I did in "standard" frequency test.
Analyzing one or all of these 3 suggestions of yours.

EnumaElish · Jan 16, 2007

yevi said:

For frequency within a block test I prefer to use Chi-square (Pearson's), like I did in "standard" frequency test.

I understand. Chi-sq. is nonparametric, which some people take as an advantage. OTOH, the parametric regression/ANOVA approach let's you to test many hypotheses simultaneously (jointly), including "difference-in-differences." In those respects the regression/ANOVA approach can be nested to an arbitrary depth.

yevi · Jan 16, 2007

So what you saying is that chi is not suitable for my specific needs?

EnumaElish · Jan 16, 2007

That is not at all what I am saying. On the contrary, a nonparametric test can be seen as an advantage. Having said that, I am pointing you toward a complementary approach (ANOVA). It is not an either/or situation. You can apply both types of tests.

yevi · Jan 17, 2007

Got it. Thanks for clearing it up :)

micro.arnab · Jun 7, 2007

go for any algorithm(local 0o global) rather than rounding up to a conclusion.that will help ur research

Empirical tests on DNA sequence

Attachments

Similar threads

Hot Threads

Recent Insights