Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Empirical tests on DNA sequence

  1. Dec 28, 2006 #1
    I am doing some biology research and need to check if a DNA sequence for complexity and repetitiveness, using Empirical test like Frequency test, Serial test and etc.
    For that I need to represent or convert the sequence into scalar.

    My question is basically, how to apply those tests on DNA sequence?
     
  2. jcsd
  3. Dec 28, 2006 #2

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    See this article.

    A test like runs test is applicable to binary sequences. But DNA sequences are multiple valued (e.g. GATTACA). As a first approach, you could separate the 4 codes into 2 binary sets, e.g. "G or A" versus "T or C" and apply the runs test. Then regroup into "G or T" versus "A or C" and re-run. Finally, run "G or C" versus "T or A." If all of these groupings are consistent with randomness, then you could "conclude randomness." But this test would be relatively strong in that you could not conclude randomness unless all of the groupings are consistent with randomness.

    Fortunately there is a more useful test. I can order the letters lexicographically (alphabetically) so that A > C > G > T. Then I can convert a sequence like GATTACA into G < A > T > T < A > C < A and re-write it as 011010 by writing "0" for "<" and "1" for ">." Since this is a binary sequence, I can apply the runs test. If there is any other ordering that you can think of as a biologist (which I am not), say G > A > T > C (perhaps on some luminal frequency scale), then by all means use that rule to convert the letters into 0's and 1's.
     
  4. Jan 2, 2007 #3
    Thanks EnumaElish.

    About frequency test,
    Is this going to be a right approach for this test:

    Count the frequency of a every letter (A,C,G,T) in sequence.
    And apply a chi-square, where number of categories is 4, p =1/4 with number of observations that fall into every category and n as sequence length.

    ??
     
  5. Jan 2, 2007 #4

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    You are right; a Chi-square test can be used, with 3 degrees of freedom. The null hypothesis is a uniform distribution with expected frequency E(i) = p(i) = 1/4 for i = A, C, G, T.
     
  6. Jan 2, 2007 #5
    EnumaElish,

    What other tests (there are many of them) I can apply on dna. Test that don't need a binary sequence as an input.

    And if all others, or the best of test are indeed work with bits, can you suggest more ways to represent sequence as bits.
     
  7. Jan 2, 2007 #6
    I've found this site http://www.bioinfo.de/isb/2006/06/0024/
    they are talking about a program that generates random DNA sequence.
    And they mention some test that they have tried.

    This is a quote from site:

    As I understand the simply convert every latter to it's 8-bit ascii value? Am I right?

    They heave more test examples in this pdf Since the output of the tests is many pages long, we show the output of only one test. The outputs of the other tests are available at the web page www.introni.it/en/software/ with the software and the help files.

    http://www.introni.it/en/software/RANDNA.pdf
     
  8. Jan 2, 2007 #7

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    These all seem to be valid procedures for determining randomness. And yes, my understanding is they are converting everything into 8-bit ASCII. Which makes it eminently suitable to apply the "runs test" that I described previously. Generally, see:

    http://en.wikipedia.org/wiki/Algorithmic_information_theory
    http://en.wikipedia.org/wiki/Algorithmically_random_sequence
    http://en.wikipedia.org/wiki/Normal_number

    Two caveats:
    1. The tests you've referenced may not be independent statistically.
    2. The term "test" which they use in the RANDNA description pages may or may not be synonymous with the term "test" in statistics. E.g., take the "pi" test (which uses the property that almost all real numbers are normal numbers [infinite random sequences]): it is not clear which statistical test they are using to test the computed pi against the true value of pi.
     
    Last edited: Jan 2, 2007
  9. Jan 2, 2007 #8
    So if converting char to 8 bit ascii fine, converting 4 DNA nucleotides for:
    A=00, C=01, G=10, T=11 legit as well?
    I ask if 2-bit representation can effect the test like "run", "serial"...
     
  10. Jan 2, 2007 #9

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    I don't see why 8 bits are necessary. Two bits will work as well. You can even use one bit to test for "up" vs. "down."
     
  11. Jan 3, 2007 #10
    I am afraid that wrong bit representation will result a lossy code and effect the randomness.

    i.e. sequence1 : AC = "0001" and sequence2: AT = "0011" should have same randomness but applying test on them will return a different result.
    Am I right?
     
  12. Jan 3, 2007 #11

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    My guess is they are using 8 bits to take advantage of existing (pre-programmed) "data mining" (or data analysis) software procedures. Otherwise, why stop at 8? Why not use a 16, 32, or 64, or 256 bit representation? In my humble opinion, the choice of bits is itself a random rule, so it shouldn't matter. But, to be "safe," you may code your data in several alternative ways (e.g. 2-bit vs. 8-bit) then apply the same test. If the DNA segment you are using is sufficiently long, my guess is that the coding will not make a qualitative difference. Actually, somewhere there may be a definition of "necessary and sufficient bit representation," in that representation of 4 categories in 8 bits is a waste of space, whereas 2 bits are necessary and sufficient.
     
  13. Jan 3, 2007 #12
    My main concern in this point is creating lossless bit representation.
    Anyway, I'll try the tests using 2-bit.

    Meanwhile are there more interesting (and good) test that you can recommend?
     
  14. Jan 3, 2007 #13

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    Actually a runs test will produce the exact result for both (2 runs in each, assuming the endpoints count as runs -- you should check this). But AC = 0001 (1 run) and AG = 0010 (3 runs), so I see your point. In 8-bit representation A = "00000001" C = "00000011" and T = "00010100" (I think), so AC = "0000000100000011" and AT = "0000000100010100"; and the runs test will return 4 for the first and 7 for the second.
     
    Last edited: Jan 3, 2007
  15. Jan 3, 2007 #14

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    You could calculate the regression equation xt = a + b xt-1 + ut, where a and b are the regression coefficients and u is a stochastic disturbance with expected value = 0 (Eu = 0), and the subscript t denotes t'th place (bit) in the sequence (e.g. X = 0000101010111...). Your dependent variable xt is binary, so technically you should use a limited dependent variable model such as Probit or Logit; although the linear model is always a good starting point. If you cannot reject the null hypothesis b = 0, then you can conclude that xt is linearly uncorrelated with xt-1 -- which is similar to ENT's correlation test.

    Generalizing, you can run the regression xt = a + b1xt-1 + ... + bkxt-k + ut (ideally using one of the limited-dep. var. techniques), then execute the joint test b1 = ... = bk = 0. This is a generalization of ENT's correlation test, in effect.

    You can also use, for example, a multinomial logit regression package to calculate the statistical relation between Zt coded as, say, {A,C,G,T} = {1,2,3,4} and Zt-j coded similarly (for j = 1, ..., k); so you don't have to convert letters to bits.
     
    Last edited: Jan 4, 2007
  16. Jan 3, 2007 #15

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    Small technical edit to previous post:
    The index for Z (data coded as letters) is different from the index for x (data coded as bits).

    A further idea is to make use of the well-known normality tests by using a probability theorem which says that if R is a random variable uniformly distributed on [0, 1] and F is a probability distribution function then the random variable Y = F-1(R) is distributed (has distribution) F. In effect, you can use a DNA sequence as a uniform preudorandom number generator by applying to it the inverse of the normal distribution, then test whether the resulting data (the Y's) are indeed normal. Alternative approaches to generating Y are described here.

    Again, since your uniform distribution is discrete, you may have to use a piecewise randomization algorithm that would enable you to expand from the 4 DNA elements to the [0, 1] continuum. For example, you could assign A = 1/8, C = 3/8, G = 5/8, T = 7/8. When you observe an A, you use a software (e.g. Excel) to generate a random number (R) between 0 and 2/8. For a C, generate a ran. var. b/w 2/8 and 4/8. For G, b/w 4/8 and 6/8; for T, 6/8 and 1. Then apply the inverse-F transformation, then test the new ran. var. Y for normality. Even better (but more time consuming) is to design a fractal structure (interval tree) whereby you map the the first 4 letters in a DNA into 5 equal subintervals of the [0, 1], then map the next 16 letters into 25 equal sub-subintervals, and so on, in a way that is graphically reminiscent of constructing a Cantor set (except in your case you will not be deleting any of the subintervals, but further dividing each subinterval into smaller and smaller segments). By dividing the [0, 1] interval into consecutively smaller subintervals, you can "fill in" the [0, 1] continuum to an arbitrary density by using progressively expanding blocks of letters in a DNA sequence, then run the inverse-F normality test.
     
    Last edited: Jan 4, 2007
  17. Jan 3, 2007 #16

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    Another idea is to use the "equality of distributions" (e.o.d.) tests (Run [or Runs] test [apparently a.k.a. Wald-Wolfowitz test], Median test [may be a.k.a. Mann-Whithey U test], Rank-sum test) (Mood, Graybill, Boes. Intro. to the Theory of Stat., 3rd Ed.) by testing the hypothesis that the distribution of a DNA sequence is equal to a uniform distribution produced by a random number generator. First, you need to generate a random sequence of 0's and 1's (or alternatively A's, C's, G's and T's) by using a standard random number generator. This will be your benchmark sequence. Then you can use any of the e.o.d. tests to test the hypothesis that the distribution of an actual DNA sequence is not different from the benchmark distribution. The advantage of this approach is that as long as you can find a way to generate random letters A, C, G, T, you do not need to convert your DNA into bits (although you may have to rank them, e.g. A=1, C=2, G=3, T=4).
     
    Last edited: Jan 4, 2007
  18. Jan 5, 2007 #17

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

  19. Jan 6, 2007 #18
    Thank you.
     
  20. Jan 7, 2007 #19

    EnumaElish

    User Avatar
    Science Advisor
    Homework Helper

    Any DNA sequence is composed of four characters, so it can be thought of a sequence of base-4 numbers. Let A=0, C=1, G=2, T=3. Then GATTAC = 203301. In base 10, this is equal to 2289: 203301_b4 = 2289. But you could also express this as {GAT, TAC} = {203, 301} = {35, 49} or as {GA, TT, AC} = {20, 33, 01} = {8, 15, 1}.

    I am now thinking that the way I have advised to construct the "inverse-F" normality test may introduce randomness to a potentially nonrandom sequence, which will bias the test result toward "false randomness."
     
    Last edited: Jan 7, 2007
  21. Jan 7, 2007 #20
    Fine but this kind of representation(base-4 or base-10) can help me with only specific tests.

    Still I don't know what should I choose for my problem from your great suggestions and other test I've read about.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Empirical tests on DNA sequence
  1. Empirical models (Replies: 5)

Loading...