Most "statistical" software (including Excel) have a correlation function that returns the Pearson partial correlation coefficient between two vectors of data. Suppose you organize your DNA sequence as a column of numbers on an Excel spreadsheet:
2
0
3
3
0
1
etc.
Let's say you entered the entire DNA sequence in column A, rows 1 through 240 (range A1:A240 in Excel notation).
Then you add another column. On row r of the second column (column B), copy (point to) the value on row r-1 of the first column (column A). Thus the first row on column B will be blank (not zero). The second value on column B will be equal to the first value on column A. The third value on column B will be equal to the second value on column A. And so on. The easiest way to do this is to go to the second row on Column B (cell B2) and write the expression:
=A1
then hit the "Enter" key. Then copy cell B2 and paste it over the cell range B3 through B240.
Then use the CORREL function like this: Go to a new blank cell and write the following expression:
=CORREL(A2:A240,B2:B240)
and hit "enter". This will produce the correlation coefficient with the DNA sequence and its once-lagged value, which is technically known as the first-order autocorrelation coefficient.
A shortcut is to have a single column (column A), then use:
=CORREL(A2:A240,A1:A239)
You can use this setup to calculate any arbitrary order of autocorrelation. For the 10th-order AC, the shortcut is
=CORREL(A11:A240,A1:A230)
There are two downsides to the Excel approach:
1. CORREL function doesn't test for statistical significance,
2. You can only calculate partial correlations (one at a time), rather than an autocorrelation structure (a number of lags jointly, a.k.a. "distributed lags").
Excel's Analysis ToolPak Add-In will let you calculate partial correlations between pairs of an arbitrary number of variables (columns or ranges); but each of the two downsides would still apply. (Note: to use the Analysis ToolPak Add-In, you have to specify [or input] each lag as a separate column -- you cannot use the shortcut described above).
You can use a "serious" statistical software like SAS to compute partial correlation coefficients and test the statistical significance of each (individually). But downside #2 would still apply. The most efficient way to eliminate both downsides is to use multiple regression analysis, which you can do in Excel using the Analysis ToolPak Add-In.
See
#22 above for an explanation of how to estimate the first through the m'th order autocorrelations
jointly using the multiple regression technique. Again, the main advantage of this approach is that it let's you to calculate and test a number of autocorrelations (lags) jointly, rather than individually.
To estimate the first order AC, all you need is to estimate the regression equation Z
s = a + b
1Z
s-1 + u
s. To estimate the first and the second order AC's jointly, you need to estimate Z
s = a + b
1Z
s-1 + b
2Z
s-2 + u
s. Etc. The data structure should be "column A, column B, etc.," where column A holds the original sequence and each subsequent column holds the once-lagged value of the previous column. Thus, column K will hold the 10 times-lagged sequence.
To test each lag individually, look at the t-statistic of the b parameter corresponding to the lag (e.g. t-stat of b
j for the j'th lag.) To test all lags jointly, look at the "regression F" statistic in the ANOVA table in Excel output. If you have a single independent (right-hand side) variable (as the x in y = a + b x), then Regression F = (t-stat of b)
2.
See attached Excel printout (into PDF). A random "DNA" sequence was regressed on distributed lags 1 through 10; none of the coefficients is statistically significant (t-stat > 2 approximately, or P-value < 0.05); neither are they jointly significant ("Significance F" is nowhere near 0.05). (I am using a 5% probability threshold for statistical significance.) Page 1 is the regression output; pages 2-5 are the data (column A is the dependent variable or the "Y variable", columns B-K are the indep. vars. or the "X variables." Rows 1-11 are not included in the regression). https://www.physicsforums.com/attachments/8791