Updating the mean and sd of a set efficiently

  • Context: Undergrad 
  • Thread starter Thread starter jaderberg
  • Start date Start date
  • Tags Tags
    Mean sd Set
Click For Summary

Discussion Overview

The discussion centers on the efficient updating of the mean and standard deviation of a one-dimensional data set when data points change frequently. Participants explore methods for updating these statistics without complete recalculation, particularly in the context of batch changes to the data.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • The original poster seeks a method to update the mean and standard deviation efficiently after multiple changes to data points, rather than recalculating from scratch.
  • Some participants suggest maintaining the sum and sum of squares for efficient updates, noting potential roundoff errors in this approach.
  • One participant provides algebraic formulas for updating the mean and variance, but others challenge the correctness of these formulas.
  • There is a distinction made between population variance and sample variance, with discussions on whether bias correction should be applied in calculations.
  • Examples are provided to illustrate the calculations of mean, standard deviation, and variance, highlighting differences in results based on the methods used.
  • Participants express varying interpretations of what constitutes the "variance of a sample" and how it relates to population parameters.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the correct formulas for updating variance and standard deviation, with multiple competing views and interpretations of statistical definitions remaining unresolved.

Contextual Notes

Limitations include potential roundoff errors in numerical methods, the need for clarity on definitions of variance and standard deviation, and the implications of using different formulas based on whether the data represents a sample or a population.

jaderberg
Messages
28
Reaction score
0
I am trying to work out the most efficient way of updating the mean and standard deviation of a 1 dimensional set of data. The data points change frequently and by a small amount each time, but I do not want to do a complete recalculation of the mean and sd after each change, as this is computationally expensive on a big data set!

Instead I am trying to just update the mean and sd, rather than fully recalculate it. I can do that for one change, but I need to be able to batch changes together and update the mean and sd approximately.

E.g.
data: 2,3,3,3,5,6,1,7 with mean1 and sd1
changes to: 2,4,3,3,4,6,1,7 (two changes of 3->4 and 5->4)

how would i use the existing values mean1, sd1, and the old and new values to update the mean and sd of the set?

I can do this for one change (i.e. mean2=mean1 + (new_val-old_val)/N and similarly for sd2) but how would i do it for multiple changes?
 
Last edited:
Physics news on Phys.org
Why don't you keep the two quantities used in computing the mean and standard deviation with each data set? That would require storing the sum and the sum of the squares. If a data point changes from x to y then subtract x from the sum and add y, etc. You may accumulate roundoff errors this way, but I think that's a danger in any method of partial updating.

You can recreate the sum and the sum of the squares from the mean and standard deviation. The sum is the mean times the number of data points, etc. This adds more operations and more opportunity for roundoff area.
 
It takes a bit of algebra to derive but the dynamic updates can be written as
\mu_{n+1} = \left(1-\tfrac{1}{n+1}\right)\mu_n + \tfrac{1}{n+1}x_{n+1}
and
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2
or equivalently
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{n+1}{n^2}(x_{n+1}-\mu_{n+1})^2
 
this formula is not correct.
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2
 
this is the right one

\sigma_{n+1}^2 = ( n*\sigma_n^2 + (x_{n+1}-\mu_{n+1})(x_{n+1}-\mu_n) )\tfrac{1}{n+1}
 
roihat said:
this formula is not correct.
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2

Are you sure? The above includes a bias correction (i.e. divide the sum of squares by (n-1) instead of n).


roihat said:
this is the right one

\sigma_{n+1}^2 = ( n*\sigma_n^2 + (x_{n+1}-\mu_{n+1})(x_{n+1}-\mu_n) )\tfrac{1}{n+1}
 
example.

1 2 3 4
mean:2.5
stddev:1.11803
variance:1.25

after adding a new number.
my formule give the correct answer to the standard deviation.

1 2 3 4 10
mean:4
stddev:3.16228
variance:10
 
The distinction being made is between "the variance of a sample" and "an estimator of the variance of the population computed from a sample". There can be a bias correction in a formula for an estimator, but I think the definition of "the variance of a sample" requires that it be computed without the bias correction.
 
you need to consider the difference between "population variance" and "sample variance". the difference in expression is not big and the difference in expression between population mean and sample mean is nothing. in either case, the standard deviation is the square root of the variance, however you calculate it.

simply maintain a count of the number of objects in the set. when an object is added to the set, update the count, the running sum (for the mean), and the running sum of squares (to compute the variance). if, in updating, an object "falls off the edge" (say, you want the running mean and sd of the last N sampled objects) then you must keep a record of the objects being counted and when one falls off the edge, you must decrement the count and subtract from the running sum and running sum of squares.

see this:

http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

to get the formulae and to understand the difference.
 
  • #10
rbj said:
you need to consider the difference between "population variance" and "sample variance". the difference in expression is not big and the difference in expression between population mean and sample mean is nothing.

I think you are talking about "an estimator for the population variance", not "the population variance". The formula for computing the population variance would use the distribution of the population as its input,. The population variance is not a function of the values in a sample.

simply maintain a count of the number of objects in the set. when an object is added to the set, update the count, the running sum (for the mean), and the running sum of squares (to compute the variance).

I should agree with that because of post #2 ! The original poster may have lost interest in this thread, but if he is writing a computer program that may do updates hundreds of times, he should be concerned about roundoff errors and, to me, it seems safer to keep something like the current sum and sum of squares than to do updates by methods that involve more numerical operations. However, for many data sets, the sum of squares will "overflow", so it cannot be computed simply by summing the squares. I don't know any numerical methods that work for all orders of magnitudes of data. The original poster needs to provide more details if he still needs advice.


see this:

http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

to get the formulae and to understand the difference.

Ugh. In its current state, that article doesn't promote understanding.
 
  • #11
roihat said:
example.

1 2 3 4
mean:2.5
stddev:1.11803
variance:1.25

after adding a new number.
my formule give the correct answer to the standard deviation.

1 2 3 4 10
mean:4
stddev:3.16228
variance:10

Further to Stephen and rbj's helpful comments, what you've calculated here is the standard deviation and variance of two discrete random variables with known distributions, which is subtly different to the problem originally posted (where we only have sample data from unknown distributions).

Including the bias correction gives different "correct" answers for the standard deviations of your two samples (1.2910 and 3.5355 respectively).

Most modern software packages include the bias correction when they compute the standard deviation of a sample (e.g. STDEV in Excel, sd in R, std in MATLAB/Octave, StandardDeviation in Mathematica).
 
  • #12
bpet said:
what you've calculated here is the standard deviation and variance of two discrete random variables with known distributions, which is subtly different to the problem originally posted (where we only have sample data from unknown distributions).

Computing "the variance of a sample" of N things by the formula that uses division by N (instead of N-1) could be interpreted that way, but the correct interpretation of my point of view is that "variance of a sample" has some , shall we say, "arbitrary" definition - at least in a given textbook. The definition may have been chosen so it can be conveniently used in a formula for some estimators, but "variance of a sample" is a statistic in its own right.

Likewise "mean of a sample" and "standard deviation" of a sample can be defined without any statement that tells how they are to be used as estimators or interpreted as parameters of some population.

The original poster did not ask about estimating the mean and standard deviation of an unknown population. The question is about the "mean and standard deviation of a 1 dimensional set of data".

Of course, answering posts is often a matter of mind reading, and it may be correct that jaderberg's intent is to estimate population parameters. It also may be that jaderberg's definition of "standard deviation of a sample" use division by N-1.

In fact, my old college text "Introduction To The Theory Of Statistics"3rd Edition by Mood, Graybill and Boes, defines (p 229) the "Sample Variance" to be S^2 = \frac{1}{n-1} =\sum_{i=1}^n(X_i - \bar{X})^2 and defines the "second moment" of the sample to be \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2. However, I think (with having checked this morning) that there are other texts which define the "variance of a sample" and "second moment" of a sample to be the same thing.
Most modern software packages include the bias correction when they compute the standard deviation of a sample (e.g. STDEV in Excel, sd in R, std in MATLAB/Octave, StandardDeviation in Mathematica).

The conventions enforced by software have a powerful effect, so I won't be surprised if the "standard" definition of sample variance in textbooks becomes (or already is) the one involving division by N-1.

A relevant thread is: https://www.physicsforums.com/showthread.php?t=371424&highlight=variance+excel Mathematics > Set Theory, Logic, Probability, Statistics > Standard deviation in Excel by poster ssd
 
Last edited:

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 29 ·
Replies
29
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 19 ·
Replies
19
Views
8K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K