Updating the mean and sd of a set efficiently

  • Thread starter Thread starter jaderberg
  • Start date Start date
  • Tags Tags
    Mean sd Set
Click For Summary
SUMMARY

The discussion focuses on efficiently updating the mean and standard deviation (SD) of a one-dimensional dataset when data points change frequently. The participants highlight the importance of using existing values, mean1 and sd1, along with the old and new values to compute updated statistics without complete recalculation. Key formulas provided include μ_{n+1} = (1 - 1/(n+1))μ_n + (1/(n+1))x_{n+1} for the mean and σ_{n+1}^2 = (1 - 1/n)σ_n^2 + (1/(n+1))(x_{n+1} - μ_n)^2 for the variance. The conversation also emphasizes the distinction between population variance and sample variance, particularly regarding bias correction.

PREREQUISITES
  • Understanding of statistical concepts such as mean and standard deviation
  • Familiarity with variance calculations and bias correction
  • Basic knowledge of algebra for deriving statistical formulas
  • Experience with programming or software tools for statistical analysis (e.g., R, Python)
NEXT STEPS
  • Research "Online algorithms for mean and variance updates" to explore efficient methods
  • Learn about "Population variance vs. sample variance" to understand bias correction implications
  • Study "Numerical stability in statistical computations" to mitigate roundoff errors
  • Explore statistical libraries in programming languages (e.g., NumPy in Python) for practical implementations
USEFUL FOR

Data scientists, statisticians, software developers, and anyone involved in real-time data analysis or statistical computing will benefit from this discussion.

jaderberg
Messages
28
Reaction score
0
I am trying to work out the most efficient way of updating the mean and standard deviation of a 1 dimensional set of data. The data points change frequently and by a small amount each time, but I do not want to do a complete recalculation of the mean and sd after each change, as this is computationally expensive on a big data set!

Instead I am trying to just update the mean and sd, rather than fully recalculate it. I can do that for one change, but I need to be able to batch changes together and update the mean and sd approximately.

E.g.
data: 2,3,3,3,5,6,1,7 with mean1 and sd1
changes to: 2,4,3,3,4,6,1,7 (two changes of 3->4 and 5->4)

how would i use the existing values mean1, sd1, and the old and new values to update the mean and sd of the set?

I can do this for one change (i.e. mean2=mean1 + (new_val-old_val)/N and similarly for sd2) but how would i do it for multiple changes?
 
Last edited:
Physics news on Phys.org
Why don't you keep the two quantities used in computing the mean and standard deviation with each data set? That would require storing the sum and the sum of the squares. If a data point changes from x to y then subtract x from the sum and add y, etc. You may accumulate roundoff errors this way, but I think that's a danger in any method of partial updating.

You can recreate the sum and the sum of the squares from the mean and standard deviation. The sum is the mean times the number of data points, etc. This adds more operations and more opportunity for roundoff area.
 
It takes a bit of algebra to derive but the dynamic updates can be written as
\mu_{n+1} = \left(1-\tfrac{1}{n+1}\right)\mu_n + \tfrac{1}{n+1}x_{n+1}
and
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2
or equivalently
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{n+1}{n^2}(x_{n+1}-\mu_{n+1})^2
 
this formula is not correct.
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2
 
this is the right one

\sigma_{n+1}^2 = ( n*\sigma_n^2 + (x_{n+1}-\mu_{n+1})(x_{n+1}-\mu_n) )\tfrac{1}{n+1}
 
roihat said:
this formula is not correct.
\sigma_{n+1}^2 = \left(1-\tfrac{1}{n}\right)\sigma_n^2 + \tfrac{1}{n+1}(x_{n+1}-\mu_n)^2

Are you sure? The above includes a bias correction (i.e. divide the sum of squares by (n-1) instead of n).


roihat said:
this is the right one

\sigma_{n+1}^2 = ( n*\sigma_n^2 + (x_{n+1}-\mu_{n+1})(x_{n+1}-\mu_n) )\tfrac{1}{n+1}
 
example.

1 2 3 4
mean:2.5
stddev:1.11803
variance:1.25

after adding a new number.
my formule give the correct answer to the standard deviation.

1 2 3 4 10
mean:4
stddev:3.16228
variance:10
 
The distinction being made is between "the variance of a sample" and "an estimator of the variance of the population computed from a sample". There can be a bias correction in a formula for an estimator, but I think the definition of "the variance of a sample" requires that it be computed without the bias correction.
 
you need to consider the difference between "population variance" and "sample variance". the difference in expression is not big and the difference in expression between population mean and sample mean is nothing. in either case, the standard deviation is the square root of the variance, however you calculate it.

simply maintain a count of the number of objects in the set. when an object is added to the set, update the count, the running sum (for the mean), and the running sum of squares (to compute the variance). if, in updating, an object "falls off the edge" (say, you want the running mean and sd of the last N sampled objects) then you must keep a record of the objects being counted and when one falls off the edge, you must decrement the count and subtract from the running sum and running sum of squares.

see this:

http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

to get the formulae and to understand the difference.
 
  • #10
rbj said:
you need to consider the difference between "population variance" and "sample variance". the difference in expression is not big and the difference in expression between population mean and sample mean is nothing.

I think you are talking about "an estimator for the population variance", not "the population variance". The formula for computing the population variance would use the distribution of the population as its input,. The population variance is not a function of the values in a sample.

simply maintain a count of the number of objects in the set. when an object is added to the set, update the count, the running sum (for the mean), and the running sum of squares (to compute the variance).

I should agree with that because of post #2 ! The original poster may have lost interest in this thread, but if he is writing a computer program that may do updates hundreds of times, he should be concerned about roundoff errors and, to me, it seems safer to keep something like the current sum and sum of squares than to do updates by methods that involve more numerical operations. However, for many data sets, the sum of squares will "overflow", so it cannot be computed simply by summing the squares. I don't know any numerical methods that work for all orders of magnitudes of data. The original poster needs to provide more details if he still needs advice.


see this:

http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

to get the formulae and to understand the difference.

Ugh. In its current state, that article doesn't promote understanding.
 
  • #11
roihat said:
example.

1 2 3 4
mean:2.5
stddev:1.11803
variance:1.25

after adding a new number.
my formule give the correct answer to the standard deviation.

1 2 3 4 10
mean:4
stddev:3.16228
variance:10

Further to Stephen and rbj's helpful comments, what you've calculated here is the standard deviation and variance of two discrete random variables with known distributions, which is subtly different to the problem originally posted (where we only have sample data from unknown distributions).

Including the bias correction gives different "correct" answers for the standard deviations of your two samples (1.2910 and 3.5355 respectively).

Most modern software packages include the bias correction when they compute the standard deviation of a sample (e.g. STDEV in Excel, sd in R, std in MATLAB/Octave, StandardDeviation in Mathematica).
 
  • #12
bpet said:
what you've calculated here is the standard deviation and variance of two discrete random variables with known distributions, which is subtly different to the problem originally posted (where we only have sample data from unknown distributions).

Computing "the variance of a sample" of N things by the formula that uses division by N (instead of N-1) could be interpreted that way, but the correct interpretation of my point of view is that "variance of a sample" has some , shall we say, "arbitrary" definition - at least in a given textbook. The definition may have been chosen so it can be conveniently used in a formula for some estimators, but "variance of a sample" is a statistic in its own right.

Likewise "mean of a sample" and "standard deviation" of a sample can be defined without any statement that tells how they are to be used as estimators or interpreted as parameters of some population.

The original poster did not ask about estimating the mean and standard deviation of an unknown population. The question is about the "mean and standard deviation of a 1 dimensional set of data".

Of course, answering posts is often a matter of mind reading, and it may be correct that jaderberg's intent is to estimate population parameters. It also may be that jaderberg's definition of "standard deviation of a sample" use division by N-1.

In fact, my old college text "Introduction To The Theory Of Statistics"3rd Edition by Mood, Graybill and Boes, defines (p 229) the "Sample Variance" to be S^2 = \frac{1}{n-1} =\sum_{i=1}^n(X_i - \bar{X})^2 and defines the "second moment" of the sample to be \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2. However, I think (with having checked this morning) that there are other texts which define the "variance of a sample" and "second moment" of a sample to be the same thing.
Most modern software packages include the bias correction when they compute the standard deviation of a sample (e.g. STDEV in Excel, sd in R, std in MATLAB/Octave, StandardDeviation in Mathematica).

The conventions enforced by software have a powerful effect, so I won't be surprised if the "standard" definition of sample variance in textbooks becomes (or already is) the one involving division by N-1.

A relevant thread is: https://www.physicsforums.com/showthread.php?t=371424&highlight=variance+excel Mathematics > Set Theory, Logic, Probability, Statistics > Standard deviation in Excel by poster ssd
 
Last edited:

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 19 ·
Replies
19
Views
7K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K