Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Serial correlation coefficient (random data)

  1. May 25, 2014 #1
    I have a byte array of length 16384 bytes obtained from random.org/bytes.

    There is a random number sequence test program available from http://www.fourmilab.ch/random/ that, when run on my byte array, returns the following:

    Code (Text):
    Entropy = 7.989469 bits per byte.

    Optimum compression would reduce the size
    of this 16384 byte file by 0 percent.

    Chi square distribution for 16384 samples is 240.41, and randomly
    would exceed this value 73.54 percent of the times.

    Arithmetic mean value of data bytes is 127.5778 (127.5 = random).
    Monte Carlo value for Pi is 3.164835165 (error 0.74 percent).
    Serial correlation coefficient is -0.007476 (totally uncorrelated = 0.0).
    **The ENT program's author was John Walker, September 1996 ( http://www.fourmilab.ch)

    I am able to reproduce the results using my own implementations for all parameters except the serial correlation result. The author states: this quantity measures the extent to which each byte in the file depends upon the previous byte. I am able to follow the source code for the computation but I cannot identify what test is being performed.

    One alternate approach I tried was to divide the array into two arrays that repeated the original array with an offset of 1 index (e.g. {1,2,3,4,5,...n} => (1,2,3,4,...n-1} and {2,3,4,5,...n}. I computed the the Pearson's Correlation for the resultant arrays: 0.009000. The approach, as expected, gives exactly 1.0 when testing a pair of arrays that are perfectly correlated (e.g. y = 2*x).

    Is using my approach a valid way to determine the extent to which each byte in the sequence depends upon the previous byte?

    (Does anyone know what test John Walker was implementing? The source code is available at http://www.fourmilab.ch/random/random.zip)
     
  2. jcsd
  3. May 25, 2014 #2

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Your approach looks correct but your result looks suspicious -- the 3 least significant digits are exactly 000? I suspect that there is some lack of precision in your calculation. That could easily account for the small difference between your result and the author's. I bet that when you get the full precision, your answers will match.
     
  4. May 26, 2014 #3
    Hi,

    I truncated the double precision number to 6 places.

    Here another dataset's results:
    ENT serial correlation: 0.009772
    My approach: 0.0033341683347135162
    0.003334
     
  5. May 26, 2014 #4

    Stephen Tashi

    User Avatar
    Science Advisor

    I don't know, but perhaps if we provide some excerpts from the source code, someone will offer an opinion.




    The loop that appears to process each byte or group of bytes has:

    To finish, there is:

     
  6. May 26, 2014 #5
    OK, after a bit of grinding, here's what I have:

    Part of the problem had to do with the way Java (my preferred language) interprets bytes. They are signed. With that resolved...

    In his discussion of the serial correlation coefficient, John Walker references:
    Knuth, Donald E. The Art of Computer Programming, Volume 2 / Seminumerical Algorithms. Reading MA: Addison-Wesley, 1969. ISBN 0-201-89684-2.
    [He states: See [Knuth, pp. 64–65] for more details.]

    In Knuth, the serial correlation statistic is defined:
    Given U[n] and V[n], the correlation coefficient between them is:
    numerator = n * sum(U[j] * V[j]) - sum(U[j]) * sum(V[j])
    denominator = [(n * sum(U[j] ^ 2) – sum(U[j]) ^ 2) * (n * sum(V[j] ^ 2) – sum(V[j]) ^ 2)] ^ 0.5
    C = numerator/denominator

    Where:
    all sums are taken over the range 0 <= j < n,
    U is the vector of independent random values
    V[j] = U[j+1] mod n

    In my language of choice:
    Code (Text):
    public static double KnuthSerialCoefficient(int[] idata) {
            double n = (double) idata.length;
            double[] U = new double[(int) n];
            double[] V = new double[(int) n];
            double sU = 0; // sum(U[j])
            double sV = 0; // sum(V[j])
            double vU2 = 0; // sum(U[j]^2)
            double sV2 = 0; // sum(V[j]^2)
            double sUV = 0; // sum(U[j]*V[j])
            for (int j = 0; j < (int) (n - 1); j++) {
                U[j] = (double) idata[j];
                V[j] = idata[j + 1] % n;
                sU += U[j];
                sV += V[j];
                vU2 += U[j] * U[j];
                sV2 += V[j] * V[j];
                sUV += U[j] * V[j];
            }
            double sUsU = sU * sU; // sum(U[j])^2
            double sVsV = sV * sV; // sum(V[j])^2
            double numerator = n * sUV - (sU * sV);
            double denominator = Math.pow((n * vU2 - sUsU) * (n * sV2 - sVsV), 0.5);
            return numerator / denominator;
        }

    Using that procedure, I get exactly the result that I obtain when using my Pearson's Correlation procedure.

    It still differs from the Walker ENT result slightly. But, if I substitute (n-1) for n in the numerator and denominator expressions, the results agree to 4 decimal places.

    I think that I will use the KnuthSerialCoefficient() method without the substitution unless there's a good reason to use "n-1."
     
  7. May 26, 2014 #6

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Thanks for the update. A couple of statistical comments of academic interest:

    1) On the issue of n versus n-1. The correlation is measuring how the variation of one series from its mean correlates with the variation of the other series from its respective mean. If the means are known, the sums of squares can be divided by n to get an average. If the means are not known, they are estimated from the sample. But that always reduces the variation of the individual values from the estimated mean. The correct compensation is to divide only by n-1. It also decreases the degree of the associated statistic. With the tiny correlation you are well within the range where it does not indicate non-random behavior.

    2) You can never completely rule out non-random behavior. You can only say that it passes a set of reasonable tests. There are pseudo-random number generators that will pass very sophisticated tests even though they are not random at all.
     
  8. Mar 7, 2015 #7
    Shouldn't it be U[(j+1)%n] which would mean that the last V will be set to the first U. U[j+1] % n can be anything.
     
  9. Jul 22, 2016 #8
    I'm a little amused to find I went through exactly the same steps as Mr Peanut before finding this page. I both ran into the signed integer problem (with python) and the (n-1) vs (n) thing in interpreting the ent code. I got my vectors (for a hardware implementation) to match ent in the same way you did, albeit only for binary data (that doesn't require multipliers).

    The other way to look at the n-1 vs. n thing is simply that the first value doesn't count. You are concerned with measuring the effect of one value on the next value. E.G. Given binary data, for a sequence of say 8 bits, 01001010 there are 7 pairs of interest, not 8 : [01, 10, 00, 01, 10, 01, 10]. The equations can be expressed as a function of the frequency of those (n-1) pairs.

    The ent code treats the first value as a special case, not counting it and putting in n-1, since it indeed counted n-1 pairs for an n bit bitstring.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted



Similar Discussions: Serial correlation coefficient (random data)
  1. Correlation coefficient (Replies: 15)

Loading...