# Degree of freedom and formula for standard variation

1. Jan 9, 2014

### Nikitin

For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since $\sum_1^n (x_i - \bar{x}) = 0$ (obviously), then $\sum_1^n (x_i - \bar{x})^2$ must have n-1 independent pieces of information? Is this connection supposed to be obvious?

My gut feeling agrees that the degree of freedom is n-1, but my brain does not understand. Can somebody explain it formally?

PS: My class statistics book is "Statistics for scientists and engineers, 9th ED". Is it crap (so far I don't like it)? You guys can recommend something better?

2. Jan 9, 2014

### AlephZero

I never heard of the term "standard variance".

The variance of a sample of size N is $\frac{1}{N}\sum_1^N (x_i-\bar x)^2$.

However that is not an unbiased estimate of the variance of the population. An unbiased estimate is $\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2$.

Some writers attempt to reduce the confusion by defining the "variance of a sample" as $\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2$.

IMO defining one thing incorrectly, to make it equal to something different, increases the confusion instead of reducing it, but it's very common.

How to do it right: https://maxwell.ict.griffith.edu.au/sso/biased_variance.pdf

3. Jan 9, 2014

### D H

Staff Emeritus
The set $\{x_i-\bar x\}$ certainly doesn't have n independent pieces of information. How can it? Those n pieces of information are not independent because they *must* sum to zero. That set has at most n -1 independent pieces of information. If the samples are truly independent then it does have n-1 pieces of information.

The degrees of freedom argument is nonetheless a bit ad hoc. A much better reason to use n-1 as the denominator as opposed to n is that the former yields an unbiased estimate of the variance. There's a rather sophisticated set of statistical machinery for what goes into forming unbiased estimates. This is however probably beyond the scope of an introductory statistics class.

4. Jan 10, 2014

### Nikitin

Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?

5. Jan 10, 2014

### pwsnafu

You have x1 to xn, but also $\overline x$. These are linearly dependent: $x_n = n \overline x - x_1 - \ldots x_{n-1}$
Not that when calculating variance, you are using $\overline x$ in addition to the $x_i$ in your equation.

6. Jan 10, 2014

### Nikitin

But this would be like saying $x_n$ is linearly dependant with $x_n$ since $x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n$. Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

When I first read the formula for sample variation I had a gut feeling on why it was correct to divide by $n-1$. I mean, if you were to calculate sample variation from an odd set by looking at the deviation from the median instead of the mean, $$\frac{\sum_1^n (x_i- \tilde{x})^2}{n-1}$$ then you would have to divide by $n-1$ because the middle $x_i$-value equals the median and thus gets zeroed out and doesn't contribute anything.

My intuition said that something similar perhaps happens when calculating sample variation? Am I on the correct track?

EDIT: After reading aleph zero's link I can see that I haven't enough knowledge of statistics to understand the answers to my question.. Well, maybe i'll try again in 6 months. Thanks 4 the help

Last edited: Jan 10, 2014
7. Jan 10, 2014

### D H

Staff Emeritus
AlephZero did not use the word "unbalanced". Both he and I used the word "unbiased."

Suppose you have a set of samples {xi} of size N of a random variable X and you want estimate some parameter θ of the underlying probability distribution. For example, θ might be the mean, or the variance, or something more esoteric such as the probability that something bad will happen because of X. Finally, suppose you find some expression/algorithm/function f({xi}) that yields an estimate ϑ of the parameter θ from those collected samples.

A key question: Is the value ϑ generated by f({xi}) a "good" estimate of θ? In other words, is f({xi}) a "good" estimator of θ?

There are a number of ways of answering this question. One approach is to look at the behavior as the sample size N grows larger. The estimate ϑ should converge to θ as N→∞. An estimator that exhibits this behavior is called an "asymptotically consistent estimator". An estimator that isn't asymptotically consistent is close to worthless. For example, if I have billions of samples, I should be able to expect that the estimate is pretty close to spot on.

Oftentimes we don't have the luxury of generating billions of samples. We might well only have a small handful of samples if it takes a lot of time or money to generate one sample. That an estimator is asymptotically consistent is no guarantee that it yields a "good" estimate when used against a finitely sized set of measurements.

Note that ϑ is itself a random variable. Draw a new set of samples from X and you will get a different value for ϑ. Because ϑ is a random variable, it too has a mean. The bias of the estimator f is defined as b[ϑ]=E[ϑ]-θ. The estimator is "unbiased" if b[ϑ]=0 (in other words, if E[ϑ]=θ).

Let's look at the uncorrected estimate of the variance s'2=1/n*∑(xi-x̄)2. As shown in AlephZero's link, E[s'2]=σ22/N, and thus the bias of this estimator is -σ2/N. This obviously is not zero. The uncorrected estimate is a biased estimate. On the other hand, the estimate that results from using N-1 as the denominator as opposed to N is an unbiased estimate.

8. Jan 10, 2014

### pwsnafu

It is whether you are linearly dependent with everyone else that is important. The set $\{x_1, \ldots, x_n\}$ is linearly independent. The set $\{x_1, \ldots, x_n, \overline x\}$ is linearly dependent.

Because you are now using $x_i$ and $\overline x$.

Last edited: Jan 10, 2014
9. Jan 10, 2014

### Office_Shredder

Staff Emeritus
A nice heuristic explanation that may get the juices flowing into why you divide by n-1 instead of n is that given $x_1,...,x_n$, $\overline{x}$ is actually the number that minimizes
$$\frac{1}{N} \sum_{i} (x_i - \overline{x})^2.$$

If I told you the actual mean of the random variable was $\mu$ (which in general will not be equal to the sample mean), you would probably guess that the variance is
$$\frac{1}{N} \sum_{i} (x_i-\mu)^2$$
instead, right? This number will always be larger than the previous guess. So we know that we are underestimating the variance when we use
$$\frac{1}{N} \sum_{i} (x_i - \overline{x})^2,$$
the only question is how much are we underestimating it by (which then gets into the unbiased explanation that others have posted above)/

Last edited: Jan 10, 2014
10. Jan 10, 2014

Think about this: I tell you I'm thinking of six numbers, and that the sum of all six is zero. Then I say: the first five are 10, 20, 30, 40, 50.
Could you give me the final number? Yes: because you know the sum is zero, that final number has to be -150. Here, even though I have six numbers in mind, it is enough to know 5 of them since the restriction that the sum equals zero will let you determine the remaining one:

First generalization: For any set of n numbers, if the sum of those numbers is zero, only n -1 of them need to be specified since the remaining one is then determined.

Now, for your question, the numbers are
$$x_1 - \overline x, x_2 - \overline x, x_3 - \overline x, \dots, x_n - \overline x$$

These add to zero, so even though n of them exist only n - 1 must be specified: the quantity n - 1 is the number of degrees of freedom for the set.

The problem with this approach: it is really, really, not-statistical, and it doesn't generalize to other settings (at least not easily).