Sample Variance: Why Divide by n-1?

  • Thread starter jackiefrost
  • Start date
  • Tags
    Variance
In summary, the sample variance is calculated by dividing the sum of squared differences between each data point and the mean by the number of data points minus one. This is done to make the estimate unbiased and it accounts for the use of the same data to estimate both the mean and the variance, which eliminates one degree of freedom. The denominator of n-1 is therefore a correction for finite values of n and in the asymptotic limit, both the n and n-1 estimators are equivalent.
  • #1
jackiefrost
140
1
If I have a sample consisting of n measurements why is the sample variance the result of dividing by n-1 instead of n?

jf
 
Physics news on Phys.org
  • #2
jackiefrost said:
If I have a sample consisting of n measurements why is the sample variance the result of dividing by n-1 instead of n?

jf

Well, some texts/people use n, but the reason for using n-1 is to make the estimate unbiased. I.e., you want the expected value of your estimate to equal the true population variance, and this requires using n-1. I'll leave the details to you...
 
  • #3
I've been looking at my textbooks explanation and Wikipedia and Wolframs webpages. I'm still confused what they mean by "unbiased estimators" and what subtracting 1 in the denomenator has to do with much of anything in helping out with whatever seems to be the isssue here...

jf
 
  • #4
Well, if you have (n-1) then the expectation of the so defined sample variance exactly equals the population variance. That is

[tex]E(S^2) = \sigma^2[/tex]

the expectation is thus independent of n. But if you define it instead as

[tex]S'^2 = \frac{1}{n}\sum_{i=1}^{n}(X_{i}-\mu)^2[/tex]

then since [itex]S'^2 = \frac{n-1}{n}S^2[/itex], you have

[tex]E(S'^2) = \frac{n-1}{n}E(S^2) = \frac{n-1}{n}\sigma^2[/tex]

which goes to [itex]\sigma^2[/itex] in the limit [itex]n\rightarrow \infty[/itex] but does not equal [itex]\sigma^2[/itex] for finite values of n. Thats why the sample variance defined with (n-1) in the denominator is called an unbiased estimator (of the population variance).

By the way, the (n-1) factor is a 'correction' for finite n. As you can see, in the asymptotic limit (only), both these definitions are equivalent.

The idea behind having estimators is to be able to find the population variance (and mean) which in an actual statistical problem, you don't know. So you construct functions (called estimators) which estimate the value of the population statistics from the sample statistics. The closer your estimator takes you to the population statistics, the better it is for you. In the large n limit, the properties of the sample approach the properties of the underlying distribution exactly.
 
Last edited:
  • #5
I have had the same problem understanding this issue. Frequently, textbooks and online websites gloss over the issue with a pithy and unsatisfactory statement about degrees of freedom, leaving me to wonder whether the real explanation has anything to do with degrees of freedom at all.

maverick, your explanation cleared up a few things for me, but one problem remains:

maverick280857 said:
Well, if you have (n-1) then the expectation of the so defined sample variance exactly equals the population variance.

Why is this, or is division by n-1 just a better estimator than division by n in the finite case. If so, why?
 
  • #6
Tedjn said:
I have had the same problem understanding this issue. Frequently, textbooks and online websites gloss over the issue with a pithy and unsatisfactory statement about degrees of freedom, leaving me to wonder whether the real explanation has anything to do with degrees of freedom at all.

It is degrees of freedom. Specifically, it's because you're already using the same data to estimate the mean; if you were to know the population mean ahead of time, and were only interested in the variance on its own, then the unbiased estimator would indeed use a denominator of n. What's more common, however, is that you need to first estimate the mean, and then use that estimate in your estimate of the variance. It's this cascaded method of estimation that throws off the variance estimator, and requires the n-1 denominator. Intuitively speaking, introducing the mean estimate into the variance estimator eliminates one degree of freedom because the mean estimate (which is just the population average), together with any n-1 of the samples, uniquely determines the other sample.

Tedjn said:
Why is this, or is division by n-1 just a better estimator than division by n in the finite case. If so, why?

Let's crank through it:

Assume we have [itex]n[/itex] i.i.d. samples [itex]\left\{x_1,\ldots,x_n \right\}[/itex] with mean [itex]\mu[/itex] and variance [itex]\sigma^2[/itex]. First, let's consider what would happen if we knew the true mean [itex]\mu[/itex] and only wanted to estimate the variance:

[tex]E\left[ \sum_{i=1}^n (x_i - \mu)^2 \right] = \sum_{i=1}^n E\left[ (x_i - \mu)^2 \right] = n\sigma^2 [/tex].

Which is to say that we'd use an estimator with denominator n to get an unbiased estimate. So far, so good, right? Now, let's examine what happens if we don't know [itex]\mu[/itex] and instead need to estimate it, using the usual sample average estimator (which is unbiased):

[tex]E\left [ \sum_{i=1}^n\left( x_i - \frac{1}{n}\sum_{k=1}^n x_k \right)^2 \right ] = \sum_{i=1}^nE\left[ \left( x_i - \frac{1}{n}\sum_{k=1}^n x_k \right)^2 \right] [/tex]

[tex] = \sum_{i=1}^n E \left[ x_i^2 - \frac{2}{n}x_i\sum_{k=1}^nx_k + \frac{1}{n^2}\left( \sum_{k=1}^n x_k \right)^2 \right] [/tex]

[tex]= n\left( \sigma^2 + \mu^2 - \frac{2}{n}(\sigma^2 + n\mu^2) + \frac{1}{n^2}(n\sigma^2 + n^2\mu^2) \right) = (n-1)\sigma^2[/tex]

So, we see that the terms arising from the mean estimator (which is a random variable) had the net effect of subtracting [itex]\sigma^2[/itex] from the sum, requiring a denominator of (n-1) for unbiasedness. I.e., it's like you're estimating the variance with a known mean, but only only (n-1) data points.

A more explicit way to demonstrate this is to write the mean estimate in terms of the true mean: [itex]\frac{1}{n}\sum_{i=1}^n x_i = \mu + \epsilon[/itex] where [itex]E(\epsilon) = 0[/itex], [itex]Var(\epsilon) = \frac{\sigma^2}{n}[/itex] and [itex]E(x_i \epsilon) = \frac{\sigma^2}{n}\,,\,\forall i[/itex]. Then, the variance estimator looks like:

[tex]\sum_{i=1}^nE\left[ (x_i - \mu - \epsilon)^2 \right] = \sum_{i=1}^n E\left[ (x_i - \mu)^2 - 2(x_i-\mu)\epsilon + \epsilon^2 \right] = n\left( \sigma^2 - 2\frac{\sigma^2}{n} + \frac{\sigma^2}{n} \right) = (n-1)\sigma^2[/tex].

Comparing that derivation with the fist one (using the true mean, that is), it should be evident that the introduction of the "error term" ([itex]\epsilon[/itex]) has had an effect equivalent to the elimination of a degree of freedom.
 
Last edited by a moderator:
  • #7
Thanks quadraphonics,

Your explanation looks very detailed, and I will take my time reading through it. In particular, it will take me a little time to wrap my head around algebra with the expectation value.

I'll probably respond again in awhile if I run into any problems, but thank you very much.
 
  • #8
Good explanations maverick and Quad! That's what I wanted. It makes sense. [Also - now I understand the n-1 from the "degrees of freedom" approach that some texts mention]

Thanks again.
 
Last edited:
  • #9
Hi all,

First of all I apologize for reviving such an old thread.

I am currently trying to find a way to proof the same thing, but without relying on the definition of expected value, but I am not sure if this is possible at all. Intuitively it seems it should be possible, but I simply cannot find a way around it. Here is what I tried.

First of all I define [tex]\mu[/tex] as the true population mean, and
[tex]\bar{x} = \frac{1}{n}\sum^n_{i=1}x_i[/tex]
as the sample mean.

Then from that I define
[tex]\sigma^2 = \frac{1}{n}\sum^n_{i=1}(x_i - \mu)^2[/tex]
as the sample variance estimated using the true mean and
[tex]s^2_{n-1} = \frac{1}{n-1}\sum^n_{i=1}(x_i - \bar{x})^2[/tex]
as the unbiased sample variance estimated using the sample mean.

Similarly I also define
[tex]s^2_{n} = \frac{1}{n}\sum^n_{i=1}(x_i - \bar{x})^2[/tex]
as the biased sample variance.

What I could do already is to proof that [tex]s^2_{n}[/tex] has a bias, and also how much that bias is. But I could not find a way to proof that [tex]s^2_{n-1}[/tex] doesn't have a bias.

Here is how I did to proof that [tex]s^2_{n}[/tex] has a bias:

If I take the term [tex](x_i-\bar{x})^2[/tex] and both add and subtract [tex]\mu[/tex] inside the parenthesis, then the expression expands this way:
[tex](x_i-\bar{x} - \mu + \mu)^2=(x_i-\mu)^2-2\bar{x}(x_i-\mu)+2\mu(x_i-\mu)+(\bar{x}-\mu)^2[/tex]
Well, since
[tex]ns^2_{n-1} = \sum^n_{i=1}(x_i - \bar{x})^2[/tex]
Then from the expression before I can get the following sums:
[tex]ns^2_n = \sum^n_{i=1}(x_i-\mu)^2-2\bar{x}\sum^n_{i=1}(x_i - \mu) + 2\mu\sum^n_{i=1}(x_i - \mu)+\sum^n_{i=1}(\bar{x} - \mu)^2[/tex]
Here I can use the fact that
[tex]\sum^n_{i=1}c = (c+c+\cdots+c)=n c[/tex]
and that
[tex]\sum^n_{i=1}x_i = n \bar{x}[/tex]
and then simplify a lot the above expression this way (sorry if I am too verbose):
[tex]ns^2_n=\sum^n_{x=1}(x_i-\mu)^2-2\bar{x}(n\bar{x}-n\mu)+2\mu(n\bar{x}-n\mu) + n(\bar{x}-\mu)^2=\sum^n_{x=1}(x_i-\mu)^2-n(\bar{x}-\mu)^2[/tex]

Therefore:
[tex]s^2_n = \sigma^2-(\bar{x}-\mu)^2[/tex]
Here the second term of the right hand side is an always negative bias that is present as long as the sample mean differs from the population mean.

My real problem is the second part. I cannot find a way to show that [tex]s^2_{n-1}[/tex] is unbiased... it seems I always end up going in circles... if I try to make [tex]s^2_{n-1} = \sigma^2[/tex] then from the definitions above I end up with [tex]s^2_{n-1} = \sigma^2 = n(\bar{x} - \mu)^2[/tex] which simply does not make sense to me!

Does anyone have any idea how I could solve this problem?
 

What is sample variance and why is it important?

Sample variance is a statistical measure of how spread out a set of data points are from the mean. It is important because it allows us to quantify the variability in our data and make comparisons between different groups or samples.

Why do we divide by n-1 when calculating sample variance?

Dividing by n-1 instead of n is known as using Bessel's correction, and it is necessary when calculating sample variance because it provides an unbiased estimate of the population variance. Without it, the sample variance tends to underestimate the true population variance.

What is the difference between sample variance and population variance?

Sample variance is calculated using a subset of data points from a larger population, while population variance uses all data points from the entire population. Sample variance is an estimate of the population variance and is used when the entire population cannot be measured.

When should I use sample variance instead of population variance?

Sample variance should be used when the entire population cannot be measured, such as in scientific experiments or surveys. It is also useful when comparing different groups or samples within a population.

Are there any limitations to using sample variance?

One limitation of sample variance is that it assumes the data is normally distributed. If the data is not normally distributed, it may not provide an accurate measure of variability. Additionally, sample variance is affected by outliers in the data, so it may not be the best measure of variability in these cases.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
335
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
740
Back
Top