Degree of freedom and formula for standard variation

Nikitin · Jan 9, 2014

For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?

My gut feeling agrees that the degree of freedom is n-1, but my brain does not understand. Can somebody explain it formally?

PS: My class statistics book is "Statistics for scientists and engineers, 9th ED". Is it crap (so far I don't like it)? You guys can recommend something better?

AlephZero · Jan 9, 2014

I never heard of the term "standard variance".

The variance of a sample of size N is ##\frac{1}{N}\sum_1^N (x_i-\bar x)^2##.

However that is not an unbiased estimate of the variance of the population. An unbiased estimate is ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

Some writers attempt to reduce the confusion by defining the "variance of a sample" as ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

IMO defining one thing incorrectly, to make it equal to something different, increases the confusion instead of reducing it, but it's very common.

How to do it right: https://maxwell.ict.griffith.edu.au/sso/biased_variance.pdf

D H · Jan 9, 2014

Nikitin said:

For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?

The set [itex]\{x_i-\bar x\}[/itex] certainly doesn't have n independent pieces of information. How can it? Those n pieces of information are not independent because they *must* sum to zero. That set has at most n -1 independent pieces of information. If the samples are truly independent then it does have n-1 pieces of information.

The degrees of freedom argument is nonetheless a bit ad hoc. A much better reason to use n-1 as the denominator as opposed to n is that the former yields an unbiased estimate of the variance. There's a rather sophisticated set of statistical machinery for what goes into forming unbiased estimates. This is however probably beyond the scope of an introductory statistics class.

Nikitin · Jan 10, 2014

Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independent pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?

pwsnafu · Jan 10, 2014

Nikitin said:

Why doesn't the set have "n independent pieces of information" just because the sum of x_i- x_mean =0?

You have x₁ to x_n, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##
Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.

Nikitin · Jan 10, 2014

pwsnafu said:

You have x₁ to x_n, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##

But this would be like saying ##x_n## is linearly dependent with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##. Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.

When I first read the formula for sample variation I had a gut feeling on why it was correct to divide by ##n-1##. I mean, if you were to calculate sample variation from an odd set by looking at the deviation from the median instead of the mean, [tex]\frac{\sum_1^n (x_i- \tilde{x})^2}{n-1}[/tex] then you would have to divide by ##n-1## because the middle ##x_i##-value equals the median and thus gets zeroed out and doesn't contribute anything.

My intuition said that something similar perhaps happens when calculating sample variation? Am I on the correct track?

EDIT: After reading aleph zero's link I can see that I haven't enough knowledge of statistics to understand the answers to my question.. Well, maybe i'll try again in 6 months. Thanks 4 the help

D H · Jan 10, 2014

Nikitin said:

Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independent pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?

AlephZero did not use the word "unbalanced". Both he and I used the word "unbiased."

Suppose you have a set of samples {x_i} of size N of a random variable X and you want estimate some parameter θ of the underlying probability distribution. For example, θ might be the mean, or the variance, or something more esoteric such as the probability that something bad will happen because of X. Finally, suppose you find some expression/algorithm/function f({x_i}) that yields an estimate ϑ of the parameter θ from those collected samples.

A key question: Is the value ϑ generated by f({x_i}) a "good" estimate of θ? In other words, is f({x_i}) a "good" estimator of θ?

There are a number of ways of answering this question. One approach is to look at the behavior as the sample size N grows larger. The estimate ϑ should converge to θ as N→∞. An estimator that exhibits this behavior is called an "asymptotically consistent estimator". An estimator that isn't asymptotically consistent is close to worthless. For example, if I have billions of samples, I should be able to expect that the estimate is pretty close to spot on.

Oftentimes we don't have the luxury of generating billions of samples. We might well only have a small handful of samples if it takes a lot of time or money to generate one sample. That an estimator is asymptotically consistent is no guarantee that it yields a "good" estimate when used against a finitely sized set of measurements.

Note that ϑ is itself a random variable. Draw a new set of samples from X and you will get a different value for ϑ. Because ϑ is a random variable, it too has a mean. The bias of the estimator f is defined as b[ϑ]=E[ϑ]-θ. The estimator is "unbiased" if b[ϑ]=0 (in other words, if E[ϑ]=θ).

Let's look at the uncorrected estimate of the variance s'²=1/n*∑(x_i-x̄)². As shown in AlephZero's link, E[s'²]=σ²-σ²/N, and thus the bias of this estimator is -σ²/N. This obviously is not zero. The uncorrected estimate is a biased estimate. On the other hand, the estimate that results from using N-1 as the denominator as opposed to N is an unbiased estimate.

pwsnafu · Jan 10, 2014

Nikitin said:

But this would be like saying ##x_n## is linearly dependent with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##.

It is whether you are linearly dependent with everyone else that is important. The set ##\{x_1, \ldots, x_n\}## is linearly independent. The set ##\{x_1, \ldots, x_n, \overline x\}## is linearly dependent.

Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

Because you are now using ##x_i## and ##\overline x##.

Office_Shredder · Jan 10, 2014

A nice heuristic explanation that may get the juices flowing into why you divide by n-1 instead of n is that given [itex]x_1,...,x_n[/itex], [itex]\overline{x}[/itex] is actually the number that minimizes
[tex]\frac{1}{N} \sum_{i} (x_i - \overline{x})^2.[/tex]

If I told you the actual mean of the random variable was [itex]\mu[/itex] (which in general will not be equal to the sample mean), you would probably guess that the variance is
[tex]\frac{1}{N} \sum_{i} (x_i-\mu)^2[/tex]
instead, right? This number will always be larger than the previous guess. So we know that we are underestimating the variance when we use
[tex]\frac{1}{N} \sum_{i} (x_i - \overline{x})^2,[/tex]
the only question is how much are we underestimating it by (which then gets into the unbiased explanation that others have posted above)/

statdad · Jan 10, 2014

Think about this: I tell you I'm thinking of six numbers, and that the sum of all six is zero. Then I say: the first five are 10, 20, 30, 40, 50.
Could you give me the final number? Yes: because you know the sum is zero, that final number has to be -150. Here, even though I have six numbers in mind, it is enough to know 5 of them since the restriction that the sum equals zero will let you determine the remaining one:

First generalization: For any set of n numbers, if the sum of those numbers is zero, only n -1 of them need to be specified since the remaining one is then determined.

Now, for your question, the numbers are
[tex] x_1 - \overline x, x_2 - \overline x, x_3 - \overline x, \dots, x_n - \overline x[/tex]

These add to zero, so even though n of them exist only n - 1 must be specified: the quantity n - 1 is the number of degrees of freedom for the set.

The problem with this approach: it is really, really, not-statistical, and it doesn't generalize to other settings (at least not easily).

Degree of freedom and formula for standard variation

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect