Serial correlation coefficient (random data)

Click For Summary

Discussion Overview

The discussion revolves around the calculation of the serial correlation coefficient for a byte array derived from random data. Participants explore various methods for computing this coefficient, comparing results from different approaches, and discussing the implications of their findings in the context of randomness and statistical analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant presents results from a random number sequence test, including a serial correlation coefficient of -0.007476, and questions the validity of their own approach to calculating this coefficient.
  • Another participant suggests that the suspicious result may stem from a lack of precision in calculations, indicating that full precision might yield matching results.
  • A participant shares results from a different dataset, showing a serial correlation of 0.009772, and their own approach yielding 0.0033341683347135162.
  • Discussion includes excerpts from the source code of the original program, with participants attempting to decipher the calculation method for the serial correlation coefficient.
  • One participant references Knuth's definition of the serial correlation statistic and provides a Java implementation that aligns with their Pearson's correlation results, though still differing from the original author's results.
  • Another participant comments on the statistical implications of using n versus n-1 in calculations, explaining the reasoning behind the choice of denominator in correlation calculations.
  • A participant questions the correctness of the formula used for V[j], suggesting it should be U[(j+1)%n] instead of U[j+1] % n.
  • One participant shares their experience of encountering similar issues with signed integers and the n versus n-1 debate, noting that the first value in a sequence may not count towards the correlation measurement.

Areas of Agreement / Disagreement

Participants express differing views on the validity of various approaches to calculating the serial correlation coefficient, particularly regarding the use of n versus n-1 in the denominator. There is no consensus on the best method, and multiple competing perspectives remain throughout the discussion.

Contextual Notes

Participants note limitations related to the precision of calculations, the interpretation of signed integers in different programming languages, and the treatment of the first value in sequences when calculating correlations. These factors contribute to the discrepancies observed in results.

Mr Peanut
Messages
30
Reaction score
0
I have a byte array of length 16384 bytes obtained from random.org/bytes.

There is a random number sequence test program available from http://www.fourmilab.ch/random/ that, when run on my byte array, returns the following:

Code:
Entropy = 7.989469 bits per byte.

Optimum compression would reduce the size
of this 16384 byte file by 0 percent.

Chi square distribution for 16384 samples is 240.41, and randomly
would exceed this value 73.54 percent of the times.

Arithmetic mean value of data bytes is 127.5778 (127.5 = random).
Monte Carlo value for Pi is 3.164835165 (error 0.74 percent).
Serial correlation coefficient is -0.007476 (totally uncorrelated = 0.0).

**The ENT program's author was John Walker, September 1996 ( http://www.fourmilab.ch)

I am able to reproduce the results using my own implementations for all parameters except the serial correlation result. The author states: this quantity measures the extent to which each byte in the file depends upon the previous byte. I am able to follow the source code for the computation but I cannot identify what test is being performed.

One alternate approach I tried was to divide the array into two arrays that repeated the original array with an offset of 1 index (e.g. {1,2,3,4,5,...n} => (1,2,3,4,...n-1} and {2,3,4,5,...n}. I computed the the Pearson's Correlation for the resultant arrays: 0.009000. The approach, as expected, gives exactly 1.0 when testing a pair of arrays that are perfectly correlated (e.g. y = 2*x).

Is using my approach a valid way to determine the extent to which each byte in the sequence depends upon the previous byte?

(Does anyone know what test John Walker was implementing? The source code is available at http://www.fourmilab.ch/random/random.zip)
 
Physics news on Phys.org
Your approach looks correct but your result looks suspicious -- the 3 least significant digits are exactly 000? I suspect that there is some lack of precision in your calculation. That could easily account for the small difference between your result and the author's. I bet that when you get the full precision, your answers will match.
 
Hi,

I truncated the double precision number to 6 places.

Here another dataset's results:
ENT serial correlation: 0.009772
My approach: 0.0033341683347135162
0.003334
 
Mr Peanut said:
(Does anyone know what test John Walker was implementing?

I don't know, but perhaps if we provide some excerpts from the source code, someone will offer an opinion.
static int mp, sccfirst;
static unsigned int monte[MONTEN];
static long inmont, mcount;
static double cexp, incirc, montex, montey, montepi,
scc, sccun, sccu0, scclast, scct1, scct2, scct3,
ent, chisq, datasum;
int oc, c, bean;

The loop that appears to process each byte or group of bytes has:

/* Update calculation of serial correlation coefficient */

sccun = c;
if (sccfirst) {
sccfirst = FALSE;
scclast = 0;
sccu0 = sccun;
} else {
scct1 = scct1 + scclast * sccun;
}
scct2 = scct2 + sccun;
scct3 = scct3 + (sccun * sccun);
scclast = sccun;
oc <<= 1;

To finish, there is:

/* Complete calculation of serial correlation coefficient */

scct1 = scct1 + scclast * sccu0;
scct2 = scct2 * scct2;
scc = totalc * scct3 - scct2;
if (scc == 0.0) {
scc = -100000;
} else {
scc = (totalc * scct1 - scct2) / scc;
}
 
OK, after a bit of grinding, here's what I have:

Part of the problem had to do with the way Java (my preferred language) interprets bytes. They are signed. With that resolved...

In his discussion of the serial correlation coefficient, John Walker references:
Knuth, Donald E. The Art of Computer Programming, Volume 2 / Seminumerical Algorithms. Reading MA: Addison-Wesley, 1969. ISBN 0-201-89684-2.
[He states: See [Knuth, pp. 64–65] for more details.]

In Knuth, the serial correlation statistic is defined:
Given U[n] and V[n], the correlation coefficient between them is:
numerator = n * sum(U[j] * V[j]) - sum(U[j]) * sum(V[j])
denominator = [(n * sum(U[j] ^ 2) – sum(U[j]) ^ 2) * (n * sum(V[j] ^ 2) – sum(V[j]) ^ 2)] ^ 0.5
C = numerator/denominator

Where:
all sums are taken over the range 0 <= j < n,
U is the vector of independent random values
V[j] = U[j+1] mod n

In my language of choice:
Code:
public static double KnuthSerialCoefficient(int[] idata) {
		double n = (double) idata.length;
		double[] U = new double[(int) n];
		double[] V = new double[(int) n];
		double sU = 0; // sum(U[j])
		double sV = 0; // sum(V[j])
		double vU2 = 0; // sum(U[j]^2)
		double sV2 = 0; // sum(V[j]^2)
		double sUV = 0; // sum(U[j]*V[j])
		for (int j = 0; j < (int) (n - 1); j++) {
			U[j] = (double) idata[j];
			V[j] = idata[j + 1] % n;
			sU += U[j];
			sV += V[j];
			vU2 += U[j] * U[j];
			sV2 += V[j] * V[j];
			sUV += U[j] * V[j];
		}
		double sUsU = sU * sU; // sum(U[j])^2
		double sVsV = sV * sV; // sum(V[j])^2
		double numerator = n * sUV - (sU * sV);
		double denominator = Math.pow((n * vU2 - sUsU) * (n * sV2 - sVsV), 0.5);
		return numerator / denominator;
	}


Using that procedure, I get exactly the result that I obtain when using my Pearson's Correlation procedure.

It still differs from the Walker ENT result slightly. But, if I substitute (n-1) for n in the numerator and denominator expressions, the results agree to 4 decimal places.

I think that I will use the KnuthSerialCoefficient() method without the substitution unless there's a good reason to use "n-1."
 
Thanks for the update. A couple of statistical comments of academic interest:

1) On the issue of n versus n-1. The correlation is measuring how the variation of one series from its mean correlates with the variation of the other series from its respective mean. If the means are known, the sums of squares can be divided by n to get an average. If the means are not known, they are estimated from the sample. But that always reduces the variation of the individual values from the estimated mean. The correct compensation is to divide only by n-1. It also decreases the degree of the associated statistic. With the tiny correlation you are well within the range where it does not indicate non-random behavior.

2) You can never completely rule out non-random behavior. You can only say that it passes a set of reasonable tests. There are pseudo-random number generators that will pass very sophisticated tests even though they are not random at all.
 
Shouldn't it be U[(j+1)%n] which would mean that the last V will be set to the first U. U[j+1] % n can be anything.
 
I'm a little amused to find I went through exactly the same steps as Mr Peanut before finding this page. I both ran into the signed integer problem (with python) and the (n-1) vs (n) thing in interpreting the ent code. I got my vectors (for a hardware implementation) to match ent in the same way you did, albeit only for binary data (that doesn't require multipliers).

The other way to look at the n-1 vs. n thing is simply that the first value doesn't count. You are concerned with measuring the effect of one value on the next value. E.G. Given binary data, for a sequence of say 8 bits, 01001010 there are 7 pairs of interest, not 8 : [01, 10, 00, 01, 10, 01, 10]. The equations can be expressed as a function of the frequency of those (n-1) pairs.

The ent code treats the first value as a special case, not counting it and putting in n-1, since it indeed counted n-1 pairs for an n bit bitstring.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
6K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
6K
  • · Replies 39 ·
2
Replies
39
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 4 ·
Replies
4
Views
5K
Replies
5
Views
17K
  • · Replies 1 ·
Replies
1
Views
2K