Serial correlation coefficient (random data)

In summary: U[0]) double sVsU =... // sum(V[0]) double sUVsU = 0; // sum(U[0]*V[0]) double sUVs = 0; // sum(U[0]*V[n-1]) double sU = 0; // sum(U[n]) double sV = 0; // sum(V[n]) return sUsU + sVsU + sUVsU;}In summary, the author was able to reproduce the results from the serial correlation program available from http://www.fourmilab.ch
  • #1
Mr Peanut
30
0
I have a byte array of length 16384 bytes obtained from random.org/bytes.

There is a random number sequence test program available from http://www.fourmilab.ch/random/ that, when run on my byte array, returns the following:

Code:
Entropy = 7.989469 bits per byte.

Optimum compression would reduce the size
of this 16384 byte file by 0 percent.

Chi square distribution for 16384 samples is 240.41, and randomly
would exceed this value 73.54 percent of the times.

Arithmetic mean value of data bytes is 127.5778 (127.5 = random).
Monte Carlo value for Pi is 3.164835165 (error 0.74 percent).
Serial correlation coefficient is -0.007476 (totally uncorrelated = 0.0).

**The ENT program's author was John Walker, September 1996 ( http://www.fourmilab.ch)

I am able to reproduce the results using my own implementations for all parameters except the serial correlation result. The author states: this quantity measures the extent to which each byte in the file depends upon the previous byte. I am able to follow the source code for the computation but I cannot identify what test is being performed.

One alternate approach I tried was to divide the array into two arrays that repeated the original array with an offset of 1 index (e.g. {1,2,3,4,5,...n} => (1,2,3,4,...n-1} and {2,3,4,5,...n}. I computed the the Pearson's Correlation for the resultant arrays: 0.009000. The approach, as expected, gives exactly 1.0 when testing a pair of arrays that are perfectly correlated (e.g. y = 2*x).

Is using my approach a valid way to determine the extent to which each byte in the sequence depends upon the previous byte?

(Does anyone know what test John Walker was implementing? The source code is available at http://www.fourmilab.ch/random/random.zip)
 
Physics news on Phys.org
  • #2
Your approach looks correct but your result looks suspicious -- the 3 least significant digits are exactly 000? I suspect that there is some lack of precision in your calculation. That could easily account for the small difference between your result and the author's. I bet that when you get the full precision, your answers will match.
 
  • #3
Hi,

I truncated the double precision number to 6 places.

Here another dataset's results:
ENT serial correlation: 0.009772
My approach: 0.0033341683347135162
0.003334
 
  • #4
Mr Peanut said:
(Does anyone know what test John Walker was implementing?

I don't know, but perhaps if we provide some excerpts from the source code, someone will offer an opinion.
static int mp, sccfirst;
static unsigned int monte[MONTEN];
static long inmont, mcount;
static double cexp, incirc, montex, montey, montepi,
scc, sccun, sccu0, scclast, scct1, scct2, scct3,
ent, chisq, datasum;
int oc, c, bean;

The loop that appears to process each byte or group of bytes has:

/* Update calculation of serial correlation coefficient */

sccun = c;
if (sccfirst) {
sccfirst = FALSE;
scclast = 0;
sccu0 = sccun;
} else {
scct1 = scct1 + scclast * sccun;
}
scct2 = scct2 + sccun;
scct3 = scct3 + (sccun * sccun);
scclast = sccun;
oc <<= 1;

To finish, there is:

/* Complete calculation of serial correlation coefficient */

scct1 = scct1 + scclast * sccu0;
scct2 = scct2 * scct2;
scc = totalc * scct3 - scct2;
if (scc == 0.0) {
scc = -100000;
} else {
scc = (totalc * scct1 - scct2) / scc;
}
 
  • #5
OK, after a bit of grinding, here's what I have:

Part of the problem had to do with the way Java (my preferred language) interprets bytes. They are signed. With that resolved...

In his discussion of the serial correlation coefficient, John Walker references:
Knuth, Donald E. The Art of Computer Programming, Volume 2 / Seminumerical Algorithms. Reading MA: Addison-Wesley, 1969. ISBN 0-201-89684-2.
[He states: See [Knuth, pp. 64–65] for more details.]

In Knuth, the serial correlation statistic is defined:
Given U[n] and V[n], the correlation coefficient between them is:
numerator = n * sum(U[j] * V[j]) - sum(U[j]) * sum(V[j])
denominator = [(n * sum(U[j] ^ 2) – sum(U[j]) ^ 2) * (n * sum(V[j] ^ 2) – sum(V[j]) ^ 2)] ^ 0.5
C = numerator/denominator

Where:
all sums are taken over the range 0 <= j < n,
U is the vector of independent random values
V[j] = U[j+1] mod n

In my language of choice:
Code:
public static double KnuthSerialCoefficient(int[] idata) {
		double n = (double) idata.length;
		double[] U = new double[(int) n];
		double[] V = new double[(int) n];
		double sU = 0; // sum(U[j])
		double sV = 0; // sum(V[j])
		double vU2 = 0; // sum(U[j]^2)
		double sV2 = 0; // sum(V[j]^2)
		double sUV = 0; // sum(U[j]*V[j])
		for (int j = 0; j < (int) (n - 1); j++) {
			U[j] = (double) idata[j];
			V[j] = idata[j + 1] % n;
			sU += U[j];
			sV += V[j];
			vU2 += U[j] * U[j];
			sV2 += V[j] * V[j];
			sUV += U[j] * V[j];
		}
		double sUsU = sU * sU; // sum(U[j])^2
		double sVsV = sV * sV; // sum(V[j])^2
		double numerator = n * sUV - (sU * sV);
		double denominator = Math.pow((n * vU2 - sUsU) * (n * sV2 - sVsV), 0.5);
		return numerator / denominator;
	}


Using that procedure, I get exactly the result that I obtain when using my Pearson's Correlation procedure.

It still differs from the Walker ENT result slightly. But, if I substitute (n-1) for n in the numerator and denominator expressions, the results agree to 4 decimal places.

I think that I will use the KnuthSerialCoefficient() method without the substitution unless there's a good reason to use "n-1."
 
  • #6
Thanks for the update. A couple of statistical comments of academic interest:

1) On the issue of n versus n-1. The correlation is measuring how the variation of one series from its mean correlates with the variation of the other series from its respective mean. If the means are known, the sums of squares can be divided by n to get an average. If the means are not known, they are estimated from the sample. But that always reduces the variation of the individual values from the estimated mean. The correct compensation is to divide only by n-1. It also decreases the degree of the associated statistic. With the tiny correlation you are well within the range where it does not indicate non-random behavior.

2) You can never completely rule out non-random behavior. You can only say that it passes a set of reasonable tests. There are pseudo-random number generators that will pass very sophisticated tests even though they are not random at all.
 
  • #7
Shouldn't it be U[(j+1)%n] which would mean that the last V will be set to the first U. U[j+1] % n can be anything.
 
  • #8
I'm a little amused to find I went through exactly the same steps as Mr Peanut before finding this page. I both ran into the signed integer problem (with python) and the (n-1) vs (n) thing in interpreting the ent code. I got my vectors (for a hardware implementation) to match ent in the same way you did, albeit only for binary data (that doesn't require multipliers).

The other way to look at the n-1 vs. n thing is simply that the first value doesn't count. You are concerned with measuring the effect of one value on the next value. E.G. Given binary data, for a sequence of say 8 bits, 01001010 there are 7 pairs of interest, not 8 : [01, 10, 00, 01, 10, 01, 10]. The equations can be expressed as a function of the frequency of those (n-1) pairs.

The ent code treats the first value as a special case, not counting it and putting in n-1, since it indeed counted n-1 pairs for an n bit bitstring.
 

1. What is serial correlation coefficient?

Serial correlation coefficient, also known as autocorrelation coefficient, is a statistical measure that indicates the degree of linear relationship between a variable and a lagged version of itself. It is commonly used to analyze time series data, where the values of a variable are recorded at different points in time.

2. How is serial correlation coefficient calculated?

The most commonly used method for calculating serial correlation coefficient is Pearson's correlation coefficient. This involves computing the covariance between the variable and its lagged version, and then dividing it by the product of their standard deviations. The resulting value ranges from -1 to 1, with a value of 0 indicating no correlation and values closer to -1 or 1 indicating strong negative or positive correlation, respectively.

3. What does a positive serial correlation coefficient indicate?

A positive serial correlation coefficient indicates a positive linear relationship between the variable and its lagged version. This means that as the values of the variable increase, the values of its lagged version also tend to increase. This can be interpreted as a pattern or trend in the data, where the variable has a tendency to follow a certain direction over time.

4. What does a negative serial correlation coefficient indicate?

A negative serial correlation coefficient indicates a negative linear relationship between the variable and its lagged version. This means that as the values of the variable increase, the values of its lagged version tend to decrease. This can be interpreted as a pattern or trend in the data, where the variable has a tendency to move in the opposite direction over time.

5. Why is it important to check for serial correlation in data?

It is important to check for serial correlation in data because it can affect the accuracy and reliability of statistical analyses. If a positive or negative relationship exists between a variable and its lagged version, it can lead to biased estimates and incorrect conclusions. Identifying and addressing serial correlation can help improve the validity of statistical models and predictions based on the data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
11
Views
12K
  • Programming and Computer Science
Replies
4
Views
1K
Replies
63
Views
6K
  • STEM Academic Advising
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
9K
Replies
1
Views
1K
Replies
2
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
7
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
7
Views
3K
Back
Top