Degree of freedom and formula for standard variation

Nikitin · Jan 9, 2014

For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?

My gut feeling agrees that the degree of freedom is n-1, but my brain does not understand. Can somebody explain it formally?

PS: My class statistics book is "Statistics for scientists and engineers, 9th ED". Is it crap (so far I don't like it)? You guys can recommend something better?

AlephZero · Jan 9, 2014

I never heard of the term "standard variance".

The variance of a sample of size N is ##\frac{1}{N}\sum_1^N (x_i-\bar x)^2##.

However that is not an unbiased estimate of the variance of the population. An unbiased estimate is ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

Some writers attempt to reduce the confusion by defining the "variance of a sample" as ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

IMO defining one thing incorrectly, to make it equal to something different, increases the confusion instead of reducing it, but it's very common.

How to do it right: https://maxwell.ict.griffith.edu.au/sso/biased_variance.pdf

D H · Jan 9, 2014

Nikitin said:

For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?

The set [itex]\{x_i-\bar x\}[/itex] certainly doesn't have n independent pieces of information. How can it? Those n pieces of information are not independent because they *must* sum to zero. That set has at most n -1 independent pieces of information. If the samples are truly independent then it does have n-1 pieces of information.

The degrees of freedom argument is nonetheless a bit ad hoc. A much better reason to use n-1 as the denominator as opposed to n is that the former yields an unbiased estimate of the variance. There's a rather sophisticated set of statistical machinery for what goes into forming unbiased estimates. This is however probably beyond the scope of an introductory statistics class.

Nikitin · Jan 10, 2014

Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?

pwsnafu · Jan 10, 2014

Nikitin said:

Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

You have x₁ to x_n, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##
Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.

Nikitin · Jan 10, 2014

pwsnafu said:

You have x₁ to x_n, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##

But this would be like saying ##x_n## is linearly dependant with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##. Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.

When I first read the formula for sample variation I had a gut feeling on why it was correct to divide by ##n-1##. I mean, if you were to calculate sample variation from an odd set by looking at the deviation from the median instead of the mean, [tex]\frac{\sum_1^n (x_i- \tilde{x})^2}{n-1}[/tex] then you would have to divide by ##n-1## because the middle ##x_i##-value equals the median and thus gets zeroed out and doesn't contribute anything.

My intuition said that something similar perhaps happens when calculating sample variation? Am I on the correct track?

EDIT: After reading aleph zero's link I can see that I haven't enough knowledge of statistics to understand the answers to my question.. Well, maybe i'll try again in 6 months. Thanks 4 the help

D H · Jan 10, 2014

Nikitin said:

Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?

AlephZero did not use the word "unbalanced". Both he and I used the word "unbiased."

Suppose you have a set of samples {x_i} of size N of a random variable X and you want estimate some parameter θ of the underlying probability distribution. For example, θ might be the mean, or the variance, or something more esoteric such as the probability that something bad will happen because of X. Finally, suppose you find some expression/algorithm/function f({x_i}) that yields an estimate ϑ of the parameter θ from those collected samples.

A key question: Is the value ϑ generated by f({x_i}) a "good" estimate of θ? In other words, is f({x_i}) a "good" estimator of θ?

There are a number of ways of answering this question. One approach is to look at the behavior as the sample size N grows larger. The estimate ϑ should converge to θ as N→∞. An estimator that exhibits this behavior is called an "asymptotically consistent estimator". An estimator that isn't asymptotically consistent is close to worthless. For example, if I have billions of samples, I should be able to expect that the estimate is pretty close to spot on.

Oftentimes we don't have the luxury of generating billions of samples. We might well only have a small handful of samples if it takes a lot of time or money to generate one sample. That an estimator is asymptotically consistent is no guarantee that it yields a "good" estimate when used against a finitely sized set of measurements.

Note that ϑ is itself a random variable. Draw a new set of samples from X and you will get a different value for ϑ. Because ϑ is a random variable, it too has a mean. The bias of the estimator f is defined as b[ϑ]=E[ϑ]-θ. The estimator is "unbiased" if b[ϑ]=0 (in other words, if E[ϑ]=θ).

Let's look at the uncorrected estimate of the variance s'²=1/n*∑(x_i-x̄)². As shown in AlephZero's link, E[s'²]=σ²-σ²/N, and thus the bias of this estimator is -σ²/N. This obviously is not zero. The uncorrected estimate is a biased estimate. On the other hand, the estimate that results from using N-1 as the denominator as opposed to N is an unbiased estimate.

pwsnafu · Jan 10, 2014

Nikitin said:

But this would be like saying ##x_n## is linearly dependant with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##.

It is whether you are linearly dependent with everyone else that is important. The set ##\{x_1, \ldots, x_n\}## is linearly independent. The set ##\{x_1, \ldots, x_n, \overline x\}## is linearly dependent.

Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

Because you are now using ##x_i## and ##\overline x##.

Office_Shredder · Jan 10, 2014

A nice heuristic explanation that may get the juices flowing into why you divide by n-1 instead of n is that given [itex] x_1,...,x_n [/itex], [itex] \overline{x}[/itex] is actually the number that minimizes
[tex] \frac{1}{N} \sum_{i} (x_i - \overline{x})^2. [/tex]

If I told you the actual mean of the random variable was [itex] \mu[/itex] (which in general will not be equal to the sample mean), you would probably guess that the variance is
[tex] \frac{1}{N} \sum_{i} (x_i-\mu)^2[/tex]
instead, right? This number will always be larger than the previous guess. So we know that we are underestimating the variance when we use
[tex] \frac{1}{N} \sum_{i} (x_i - \overline{x})^2, [/tex]
the only question is how much are we underestimating it by (which then gets into the unbiased explanation that others have posted above)/

statdad · Jan 10, 2014

Think about this: I tell you I'm thinking of six numbers, and that the sum of all six is zero. Then I say: the first five are 10, 20, 30, 40, 50.
Could you give me the final number? Yes: because you know the sum is zero, that final number has to be -150. Here, even though I have six numbers in mind, it is enough to know 5 of them since the restriction that the sum equals zero will let you determine the remaining one:

First generalization: For any set of n numbers, if the sum of those numbers is zero, only n -1 of them need to be specified since the remaining one is then determined.

Now, for your question, the numbers are
[tex]
x_1 - \overline x, x_2 - \overline x, x_3 - \overline x, \dots, x_n - \overline x
[/tex]

These add to zero, so even though n of them exist only n - 1 must be specified: the quantity n - 1 is the number of degrees of freedom for the set.

The problem with this approach: it is really, really, not-statistical, and it doesn't generalize to other settings (at least not easily).

Degree of freedom and formula for standard variation

Related to Degree of freedom and formula for standard variation

1. What is the concept of degree of freedom?

2. How is degree of freedom calculated?

3. What is the formula for standard variation?

4. How is the formula for standard variation used in data analysis?

5. What are the limitations of using standard variation?

Similar threads

Hot Threads

Recent Insights