Updating the mean and sd of a set efficiently

jaderberg · Sep 1, 2011

I am trying to work out the most efficient way of updating the mean and standard deviation of a 1 dimensional set of data. The data points change frequently and by a small amount each time, but I do not want to do a complete recalculation of the mean and sd after each change, as this is computationally expensive on a big data set!

Instead I am trying to just update the mean and sd, rather than fully recalculate it. I can do that for one change, but I need to be able to batch changes together and update the mean and sd approximately.

E.g.
data: 2,3,3,3,5,6,1,7 with mean1 and sd1
changes to: 2,4,3,3,4,6,1,7 (two changes of 3->4 and 5->4)

how would i use the existing values mean1, sd1, and the old and new values to update the mean and sd of the set?

I can do this for one change (i.e. mean2=mean1 + (new_val-old_val)/N and similarly for sd2) but how would i do it for multiple changes?

Stephen Tashi · Sep 1, 2011

Why don't you keep the two quantities used in computing the mean and standard deviation with each data set? That would require storing the sum and the sum of the squares. If a data point changes from x to y then subtract x from the sum and add y, etc. You may accumulate roundoff errors this way, but I think that's a danger in any method of partial updating.

You can recreate the sum and the sum of the squares from the mean and standard deviation. The sum is the mean times the number of data points, etc. This adds more operations and more opportunity for roundoff area.

bpet · Sep 2, 2011

It takes a bit of algebra to derive but the dynamic updates can be written as
\mu_{n+1} = \left(1-\tfrac{1}{n+1}\right)\mu_n + \tfrac{1}{n+1}x_{n+1}
and
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2
or equivalently
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{n+1}{n^2}(x_{n+1}-\mu_{n+1})^2

roihat · Nov 16, 2011

this formula is not correct.
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2

roihat · Nov 16, 2011

this is the right one

\sigma_{n+1}^2 = ( n*\sigma_n^2 + (x_{n+1}-\mu_{n+1})(x_{n+1}-\mu_n) )\tfrac{1}{n+1}

bpet · Nov 17, 2011

roihat said:

this formula is not correct.
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2

Are you sure? The above includes a bias correction (i.e. divide the sum of squares by (n-1) instead of n).

roihat said:

this is the right one

\sigma_{n+1}^2 = ( n*\sigma_n^2 + (x_{n+1}-\mu_{n+1})(x_{n+1}-\mu_n) )\tfrac{1}{n+1}

roihat · Nov 17, 2011

example.

1 2 3 4
mean:2.5
stddev:1.11803
variance:1.25

after adding a new number.
my formule give the correct answer to the standard deviation.

1 2 3 4 10
mean:4
stddev:3.16228
variance:10

Stephen Tashi · Nov 17, 2011

The distinction being made is between "the variance of a sample" and "an estimator of the variance of the population computed from a sample". There can be a bias correction in a formula for an estimator, but I think the definition of "the variance of a sample" requires that it be computed without the bias correction.

rbj · Nov 17, 2011

you need to consider the difference between "population variance" and "sample variance". the difference in expression is not big and the difference in expression between population mean and sample mean is nothing. in either case, the standard deviation is the square root of the variance, however you calculate it.

simply maintain a count of the number of objects in the set. when an object is added to the set, update the count, the running sum (for the mean), and the running sum of squares (to compute the variance). if, in updating, an object "falls off the edge" (say, you want the running mean and sd of the last N sampled objects) then you must keep a record of the objects being counted and when one falls off the edge, you must decrement the count and subtract from the running sum and running sum of squares.

see this:

http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

to get the formulae and to understand the difference.

Stephen Tashi · Nov 17, 2011

rbj said:

you need to consider the difference between "population variance" and "sample variance". the difference in expression is not big and the difference in expression between population mean and sample mean is nothing.

I think you are talking about "an estimator for the population variance", not "the population variance". The formula for computing the population variance would use the distribution of the population as its input,. The population variance is not a function of the values in a sample.

simply maintain a count of the number of objects in the set. when an object is added to the set, update the count, the running sum (for the mean), and the running sum of squares (to compute the variance).

I should agree with that because of post #2 ! The original poster may have lost interest in this thread, but if he is writing a computer program that may do updates hundreds of times, he should be concerned about roundoff errors and, to me, it seems safer to keep something like the current sum and sum of squares than to do updates by methods that involve more numerical operations. However, for many data sets, the sum of squares will "overflow", so it cannot be computed simply by summing the squares. I don't know any numerical methods that work for all orders of magnitudes of data. The original poster needs to provide more details if he still needs advice.

see this:

http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

to get the formulae and to understand the difference.

Ugh. In its current state, that article doesn't promote understanding.

bpet · Nov 18, 2011

roihat said:

example.

1 2 3 4
mean:2.5
stddev:1.11803
variance:1.25

after adding a new number.
my formule give the correct answer to the standard deviation.

1 2 3 4 10
mean:4
stddev:3.16228
variance:10

Further to Stephen and rbj's helpful comments, what you've calculated here is the standard deviation and variance of two discrete random variables with known distributions, which is subtly different to the problem originally posted (where we only have sample data from unknown distributions).

Including the bias correction gives different "correct" answers for the standard deviations of your two samples (1.2910 and 3.5355 respectively).

Most modern software packages include the bias correction when they compute the standard deviation of a sample (e.g. STDEV in Excel, sd in R, std in MATLAB/Octave, StandardDeviation in Mathematica).

Stephen Tashi · Nov 18, 2011

bpet said:

what you've calculated here is the standard deviation and variance of two discrete random variables with known distributions, which is subtly different to the problem originally posted (where we only have sample data from unknown distributions).

Computing "the variance of a sample" of N things by the formula that uses division by N (instead of N-1) could be interpreted that way, but the correct interpretation of my point of view is that "variance of a sample" has some , shall we say, "arbitrary" definition - at least in a given textbook. The definition may have been chosen so it can be conveniently used in a formula for some estimators, but "variance of a sample" is a statistic in its own right.

Likewise "mean of a sample" and "standard deviation" of a sample can be defined without any statement that tells how they are to be used as estimators or interpreted as parameters of some population.

The original poster did not ask about estimating the mean and standard deviation of an unknown population. The question is about the "mean and standard deviation of a 1 dimensional set of data".

Of course, answering posts is often a matter of mind reading, and it may be correct that jaderberg's intent is to estimate population parameters. It also may be that jaderberg's definition of "standard deviation of a sample" use division by N-1.

In fact, my old college text "Introduction To The Theory Of Statistics"3rd Edition by Mood, Graybill and Boes, defines (p 229) the "Sample Variance" to be S^2 = \frac{1}{n-1} =\sum_{i=1}^n(X_i - \bar{X})^2 and defines the "second moment" of the sample to be \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2. However, I think (with having checked this morning) that there are other texts which define the "variance of a sample" and "second moment" of a sample to be the same thing.

Most modern software packages include the bias correction when they compute the standard deviation of a sample (e.g. STDEV in Excel, sd in R, std in MATLAB/Octave, StandardDeviation in Mathematica).

The conventions enforced by software have a powerful effect, so I won't be surprised if the "standard" definition of sample variance in textbooks becomes (or already is) the one involving division by N-1.

A relevant thread is: https://www.physicsforums.com/showthread.php?t=371424&highlight=variance+excel Mathematics > Set Theory, Logic, Probability, Statistics > Standard deviation in Excel by poster ssd

Updating the mean and sd of a set efficiently

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Undergrad What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

High School How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

High School Onto set mapping is the surjective set mapping, and into injective?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers