# Calculating autocorrelation function

1. Nov 22, 2012

### vjramana

I am posting this question here because my field of research is biophysics and i am doing molecular dynamics (MD) simulation on bilayer system.

Well, what I want is some explanations on calculating auto-correlation function using MD simulation data.
I have 4000 frames out of my MD simulation. I want to calculate correlation function (correlation constant) for a particular physical property say, x, from each frame through out all time evolution.

When I look into literature, there I saw an equation in the form of C(t) = <x(0)x(t)>.

1) What does this equation really tells us?
2) How I can proceed to calculate auto-correlation function for 4000, x values from each frame?

I really appreciate any explanation.
Thanks

2. Nov 22, 2012

### sophiecentaur

In non-maths terms, a cross-correlation operation shows the degree of similarity of two functions (same number of samples, of course). It is done in two stages. It first takes a function and, sample by sample, it multiplies each sample by a sample of the other (comparison) function and adds all those answers up. It then shifts one set of samples by one (putting the samples all in a loop) and repeats the last process. The cross-correlation function is the set of samples that you get when you have shifted right round the loop. The more the two functions have in common, the bigger the 'swings' of the correlation function.
With the autocorrelation function, you do this but with the function itself, rather than a comparison function.
An interesting characteristic of 'truly random noise' is that its autocorellation function is zero for all values of offset except zero (i.e. it's a single spike).

It's a function that's available in most maths packages and libraries if you don't feel like writing it yourself.

3. Nov 22, 2012

### vjramana

Well, can I find the code for this auto-correlation in numerical recipes? Could you suggest any site giving examples?

4. Nov 23, 2012

### K^2

Are your x values for each frame ordered in some way, or random? It will make a difference in how you need to compute the expectation value.

5. Nov 23, 2012

### vjramana

Dear K^2,

I calculate angle between lipid and bilayer normal from each frame. So I think there is no random data.

6. Nov 23, 2012

### K^2

Ah, I misread something in original post. Typically, the way you'd estimate autocorrelation from series of measurements labeled xi=x(ti) for i running 1 to N is like this.

$$C(t_j) = \sum_{i=1}^{N-j}\frac{(x_i-\mu)(x_{i+j}-\mu)}{(N-j)\sigma^2}$$

Naturally, this assumes that tj are evenly spaced. Keep in mind that if you are using the same data to compute μ and σ, the result is biased. The bias gets significantly worse for high autocorrelation times. So basically, you want to make sure that tN is much greater than maximum tj for which you want to know C(tj).

Hopefully, this is what you were looking for.

7. Nov 23, 2012

### sophiecentaur

Last time I did this it, I wrote my own code in Fortran - so you can see I'm out of date. I should imagine Mathmatica would do it for you.
You could post on the Maths part of PF to get the best chance of someone knowing about where to get the software. Good luck.
Easy to do for yourself with Basic, though. Just a few FOR (??) loops and some indices. It's not too hard to use the visual basic in Excel.

8. Nov 26, 2012

### vjramana

Dear K^2,

I have few questions to ask relating to equation $C(t_j)$, to correct my understanding.
1) Does for every new $t_j$ value (called as correlation time), which increases in step wise after each loop, the calculation needed to be done starting from frame 1 until frame N (in my case a total of 4000 frames)? Means when the $t_j$ increases the number of calculation (samples) will decrease?

2) Is there any limiting value for $t_j$?

3) Does the $μ$ and $σ$ calcualted once only for all $t_j$?

4) Does this equation normalized?

Last edited: Nov 26, 2012
9. Nov 26, 2012

### K^2

Yes, this is the normalized form. The autocorrelation computed this way should vary mostly between -1 and 1. That's what σ² in denominator takes care of.

If you have a better way to estimate μ and σ, use it. If not, yeah, you just compute it once for all tj. That will introduce a bias into your computations, but not much you can do about that.

Your tj values are limited by whatever is the length of time over which you took measurements. Basically, tN is duration of experiment. You cannot extend the computations past that point. And yes, as you get closer to that last frame, you have fewer and fewer computation samples, so precision will get worse.

It is possible to estimate the uncertainty of C(tj) for each tj, which might tell you how far you can get before the values become unreliable.

10. Nov 26, 2012

### PhilDSP

If you have an FFT package available, the most simple and efficient way to generate an autocorrelation is:

1. Take the FFT of your data
2. Multiply the FFT by the complex conjugate of the FFT
3. Take the inverse FFT of step #2

11. Nov 26, 2012

### K^2

This gives the signal-processing version of autocorrelation. It does not take into account the mean and standard deviation. In addition, there is a hidden assumption of periodicity in this approach that might not be valid for this particular application.

12. Nov 26, 2012

### the_emi_guy

FYI,
Both correlation and FFT are freely available in Microsoft Excel (install analysis tookpak).
I have used the FFT, I have not used the correlation.

13. Nov 27, 2012

### PhilDSP

Because of the local nature of the data in the windows used with the FFT or DFT, periodicity of the original data is neither assumed nor implied. The values of either the original data or the extension of the Fourier series outside of each data window at each iteration are immaterial - not even evaluated. Therefore the original data need not be periodic.

I'm not sure how considerations of mean and standard deviation enter into the picture. The choice of the size of the data window would certainly affect the results and the FFT approach would probably be exactly numerically equivalent to the original approach only in the case where the data window is the entire data sample.

I do know though, that the results of the autocorrelation (implemented by any means) give separate deviation values for different parts of the signal or data. Broadband noise has its own deviation value shown by the height of the spike at the FFT or autocorrelation center point compared to the average height of all other parts of the autocorrelation signal. That allows you to determine a signal-to-noise ratio. (Now I see sophiecentaur already mentioned that)

Last edited: Nov 27, 2012
14. Nov 27, 2012

### K^2

You need to take another look at discrete convolution theorem. Suppose, I have two discrete data sets fj and gj of length N each. I am interested in product of DFTs of these two functions.

$$F_k G_k = \sum_{m=0}^{N-1}f_m e^{-2i\pi \frac{k m}{N}} \sum_{n=0}^{N-1}g_n e^{-2i\pi \frac{k n}{N}} = \sum_{n=o}^{N-1}\sum_{m=0}^{N-1}f_m g_n e^{-2i\pi \frac{k (m+n)}{N}}$$
The order in which the sum over n is performed is irrelevant, so I can shift that sum.
$$\sum_{n=o}^{N-1}\sum_{m=0}^{N-1}f_m g_{n-m\hspace{4px}mod N}e^{2i\pi \frac{k (m+n-m)}{N}} = \sum_{n=o}^{N-1}\sum_{m=0}^{N-1}f_m g_{n-m\hspace{4px}mod N}e^{2i\pi \frac{k n}{N}}$$

Which is a DFT of a circular convolution f*g defined as bellow.

$$(f*g)_n = \sum_{m=0}^{N-1}f_m g_{n-m\hspace{4px}mod N}$$

Note that the convolution theorem for discrete case only works if you can take that modulus. And that means at least one of the two functions, f or g, has to be periodic. The reason you can use convolution theorem with autocorrelations comes from the following definition of the autocorrelation.

$$C_n = \frac{1}{N}\sum_{m=0}^{N-1}f_m \bar{f}_{m-n\hspace{4px}mod N}$$

So with substitution $g_j = \bar{f}_{-j\hspace{4px}mod N}$ you have that inverse DFT of product of DFT of f and complex conjugate of DFT of f is equal to the above defintion of autocorrelation.

Note that both the convolution and autocorrelation in this derrivation have to be cyclic, or the entire thing simply does not work. Furthermore, notice that autocorrelation is taken relative to the mean of zero and is not normalized.

These are all perfectly good assumptions in signal processing. I assume your background is in DSP based on your name, so it makes perfect sense that you are used to dealing with autocorrelation in this way. The mean is going to be zero to within a DC offset for any signal. Furthermore, even if instead of a periodic signal you are analyzing a pulse, if you take a sufficiently large chunk of time around the pulse, it might as well be periodic. These things don't get in the way of you analyzing a signal.

However, these assumptions are not true for an arbitrary statistical random variable you might want to consider for autocorrelation analysis. Since OP is looking at orientations of the lipids, the true mean of the data might have relevance, and it is certainly not cyclic in general. He might be looking at response to a pulse of some sort in which case the cyclic approximation is still valid, but we don't know that. And we don't know whether he has a large enough window for that. So the FFT method might produce erroneous results.

15. Nov 28, 2012

### PhilDSP

Very interesting. It's been quite a few years since I last worked in the field and had begun to look into the statistical side of things. A naive thought would be "why not simply normalize the mean of the data set to zero?" for data in general. But I suppose if you do that then new sets of data will not be compatible unless you continually normalize the total at every analysis run. And probably more importantly, doing that would assume the process generating the data is ergodic and Gaussian, wouldn't it? Which may be invalid assumptions.

I'd like to find a very high level overview of how the statistical considerations can be analyzed through the method of moments, higher order moments, cumulants, correlation matrices, covariance matrices and regression. The baffling thing there is that multi-dimensional FFT's can be used to generate the higher order moment information.

Last edited: Nov 28, 2012
16. Nov 28, 2012

### K^2

Data being normally distributed is assumed pretty much either way. It's not a bad assumption, all things considering. You really only need to know that estimate of the mean improves as the $\sigma \sqrt{N}$, and you can pretty much guarantee that with enough data points.

Ergodicity is a big assumption here, yes. But it's present in both treatments of the problem. Otherwise, instead of having a common mean and variance, each data point has its own. The autocorrelation, then, is not a vector, but a tensor.

$$C_{ij} = \frac{<(F_i-\mu_i)(F_j-\mu_j)>}{\sigma_i \sigma_j}$$

Here, by Fi I mean the random variable that will give me possible values of fi. Naturally, this is a mess, and probably won't tell OP anything useful. Once we assume that all Fi are the same, but not necessarily independent, we can simplify this autocorrelation to a vector, with statistics for expectation values being gathered directly from data points of a single run.

And like I mentioned earlier, if μ and σ are gathered from the same run, that naturally introduces a bias. But there might be no way around that.

Once all these things are given, your approach is entirely valid. You can simply compute fi-μ, get autocorrelation from that, and normalize by σ² in the end. And if you were really dealing with a cyclic data set or a short pulse, whose duration is much shorter than your window, then you could do this with FFTs. But we don't know that from OP's description. Either way, the computations required here aren't that heavy. He's not losing much by doing this autocorrelation the brute force way. So it's safer just to do that.

17. Dec 7, 2012

### vjramana

Dear K^2,

When I calculate the autocorrelation function for 4000 frames with 1800 (tcorr) as for correlation time. I get negative values after about 600 frame and the graph approach towards tcorr=0.

In the literature it is stated as 'caging' effect. What is really means by caging effect?

Since my simulation is in crystal state ( not liquid crystal as most of the work published), can I accept the negative value is due to the 'caging' effect?

Regards