Empirical tests on DNA sequence

  • Thread starter Thread starter yevi
  • Start date Start date
  • Tags Tags
    Dna Sequence
Click For Summary
The discussion focuses on applying empirical tests to analyze DNA sequences for complexity and randomness, specifically using methods like the runs test and chi-square test. Participants suggest converting DNA sequences into binary representations to facilitate these tests, with various coding schemes discussed, including 2-bit and 8-bit representations. The importance of ensuring lossless coding to maintain the integrity of randomness tests is emphasized. Additionally, alternative statistical methods, such as multinomial logit regression and normality tests, are proposed for analyzing DNA sequences without requiring binary conversion. Overall, the conversation highlights multiple approaches and considerations for effectively assessing DNA sequence randomness.
  • #31
hmm Anova seems interesting
 
Physics news on Phys.org
  • #32
yevi said:
I am not sure that circular ordering like this is suitable and can be used for run
You can use any ordering you like as long as you "trick" the test into treating the endpoints similarly with the interior points.
 
  • #33
yevi said:
hmm Anova seems interesting
Excel Analysis ToolPak Add-In (which, if installed, will show up as an item "Data Analysis" under the "Tools" menu) also has built-in ANOVA packages -- which I have never used because (for me) regression is more intuitive and practical.
 
  • #34
See edits to the tests described in #29
 
  • #35
This I understand, I need a "trick" which is not lossy.
If, first I run with:
A > C > G > T > A
and then:
A >T > C > G > A
on same sequence, and get different results, that means trick is lossy.
 
  • #36
But if the sequence is random, you shouldn't. So under the null hypothesis (random DNA), the trick isn't "lossy."
 
  • #37
You last statement is very interesting (for me) I am going to apply run tests with various cycling orders and see what I'll get :)
 
  • #38
What about some "pattern matching" based, tests I can use?
Maybe "Poker"?
 
  • #39
Surely, you can apply any of these tests. (ANOVA is a more rigorous version of the frequency test.)
 
  • #40
If you don't mind, I'll return to RUN test and bring a quote from a book
THE ART OF COMPUTER PROGRAMMING V 2, DONALD E. KNUTH

Run test. A sequence may also be tested for "runs up" and "runs down."
This means we examine the length of monotone subsequences of the original
sequence, i.e., segments that are increasing or decreasing.
As an example of the precise definition of a run, consider the sequence of ten
numbers "1298536704"; putting a vertical line at the left and right and between
Xj and Xj+1 whenever Xj >Xj+1, we obtain |1 2 9| 8|5| 3 6 7 |0 4|, which displays the "runs up": there is a run of length 3, followed by two runs of length 1, followed by another run of length 3, followed by a run of length 2.

In this test he doesn't use circular ordering. Why he uses only "run-up"?
 
Last edited:
  • #41
I suspect he is not using circularity because two endpoints (0 and 9) out of 10 are relatively few; as opposed to two in four. And since he is looking at run ups only, that's actually one out of 10.

A runs test can be set up to test the number of run ups, the # of run downs, the # of both ups and downs, the # of runs above/below the mean or the median, the # of runs of a predetermined length, the maximum run length, and I suppose many more. Each of these tests is based on a different way of constructing the random variable "a run" (except the last, where the r.v. is the "run length"). A random variable can be thought of as another name for a probability distribution. You can construct any test if you know the distribution that applies to that test.
 
Last edited:
  • #42
Please disregard[/color] #29[/color] above.[/color]

Problems:
1. The regression equation will not measure what it is intended for,
2. Although for very long random sequences (e.g. > 1,000) each letter should appear a quarter of the time, for a relatively short random sequence there can be significantly more of one letter than another.

At the very least, this needs more thought on my part.
 
Last edited:
  • #43
ok,very well
 
  • #44
Now I am trying to apply the Frequency within a block test that was suggested by Prof.

For that I am using the chi as I wrote on the first page
(Count the frequency of a every letter (A,C,G,T) in sequence.
And apply a chi-square, where number of categories is 4, p =1/4 with number of observations that fall into every category and n as sequence length.)

My question is what should be the size of block (the input is 350-400)? And what results of each block-test I should expect to conclude that sequence is random?

thanks.
 
  • #45
[PLAIN said:
http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test]The[/PLAIN] approximation to the chi-square distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 10% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10. In this case, a better approximation can be had by reducing the absolute value of each difference between observed and expected frequencies by 0.5 before squaring; this is called Yates' correction.
See also: http://www.statisticssolutions.com/Chi_square_test.htm

You can also use regression analysis (ANOVA) to test:
1. whether the freq. of a letter within any block is equal to the freq. of the same letter in any other block,
2. whether the difference between the freqs. of two letters is significant within each block,
3. whether the difference between freqs. is different across blocks.

See attached Excel printout with 2 blocks. (These are t tests, so the sample size should be at least 20 letters per block.) The models are:

D1 = b2 + d1 block1 + u
D1 = b1 + d2 block2 + u
D1 = b1 block1 + b2 block2 + u (Constant [the intercept] is Zero)

D1-D2 = b2* + d1* block1 + u
D1-D2 = b1* + d2* block2 + u
D1-D2 = b1* block1 + b2* block2 + u (Constant [the intercept] is Zero)

which show d1 = d2 = d1* = d2* = 0 statistically (t-stat too low or p-value too high), so there is no difference between the blocks. (Although, b2* is statistically significant, which implies a statistically significant difference between the freqs. of C and G within block 2.)

If you had 5 blocks, you could run:

D1 = b1 + d2 block2 + d3 block3 + d4 block4 + d5 block5 + u

where the estimated b1 coefficient is freq. of C in the 1st block and estimated dj coefficient (j > 1) is the difference between the freq. of C in the first block and the freq. of C in the j'th block.

To test whether b1 = 1/4, the "Y" variable was redefined as D1 - 1/4 as in Attachment 2:

D1-0.25 = b1** block1 + b2** block2 + u (Constant [the intercept] is Zero)

which shows that the freq. of C in either block is not statistically different from 0.25 at the 5% level of significance. (Note: bj** = bj - 0.25 where bj is the expected freq. of C in the j'th block, as in the D1 models above.)
 

Attachments

Last edited by a moderator:
  • #46
For frequency within a block test I prefer to use Chi-square (Pearson's), like I did in "standard" frequency test.
Analyzing one or all of these 3 suggestions of yours.
 
  • #47
yevi said:
For frequency within a block test I prefer to use Chi-square (Pearson's), like I did in "standard" frequency test.
I understand. Chi-sq. is nonparametric, which some people take as an advantage. OTOH, the parametric regression/ANOVA approach let's you to test many hypotheses simultaneously (jointly), including "difference-in-differences." In those respects the regression/ANOVA approach can be nested to an arbitrary depth.
 
  • #48
So what you saying is that chi is not suitable for my specific needs?
 
  • #49
That is not at all what I am saying. On the contrary, a nonparametric test can be seen as an advantage. Having said that, I am pointing you toward a complementary approach (ANOVA). It is not an either/or situation. You can apply both types of tests.
 
Last edited:
  • #50
Got it. Thanks for clearing it up :)
 
  • #51
go for any algorithm(local 0o global) rather than rounding up to a conclusion.that will help ur research
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
14
Views
3K
  • · Replies 8 ·
Replies
8
Views
4K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
2K