Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Degree of freedom and formula for standard variation

  1. Jan 9, 2014 #1
    For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

    Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?

    My gut feeling agrees that the degree of freedom is n-1, but my brain does not understand. Can somebody explain it formally?

    PS: My class statistics book is "Statistics for scientists and engineers, 9th ED". Is it crap (so far I don't like it)? You guys can recommend something better?
  2. jcsd
  3. Jan 9, 2014 #2


    User Avatar
    Science Advisor
    Homework Helper

    I never heard of the term "standard variance".

    The variance of a sample of size N is ##\frac{1}{N}\sum_1^N (x_i-\bar x)^2##.

    However that is not an unbiased estimate of the variance of the population. An unbiased estimate is ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

    Some writers attempt to reduce the confusion by defining the "variance of a sample" as ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

    IMO defining one thing incorrectly, to make it equal to something different, increases the confusion instead of reducing it, but it's very common.

    How to do it right: https://maxwell.ict.griffith.edu.au/sso/biased_variance.pdf
  4. Jan 9, 2014 #3

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    The set [itex]\{x_i-\bar x\}[/itex] certainly doesn't have n independent pieces of information. How can it? Those n pieces of information are not independent because they *must* sum to zero. That set has at most n -1 independent pieces of information. If the samples are truly independent then it does have n-1 pieces of information.

    The degrees of freedom argument is nonetheless a bit ad hoc. A much better reason to use n-1 as the denominator as opposed to n is that the former yields an unbiased estimate of the variance. There's a rather sophisticated set of statistical machinery for what goes into forming unbiased estimates. This is however probably beyond the scope of an introductory statistics class.
  5. Jan 10, 2014 #4
    Thanks for all replies!

    Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

    DH: Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

    And what does "unbiased" mean?
  6. Jan 10, 2014 #5


    User Avatar
    Science Advisor

    You have x1 to xn, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##
    Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.
  7. Jan 10, 2014 #6
    But this would be like saying ##x_n## is linearly dependant with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##. Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

    When I first read the formula for sample variation I had a gut feeling on why it was correct to divide by ##n-1##. I mean, if you were to calculate sample variation from an odd set by looking at the deviation from the median instead of the mean, [tex]\frac{\sum_1^n (x_i- \tilde{x})^2}{n-1}[/tex] then you would have to divide by ##n-1## because the middle ##x_i##-value equals the median and thus gets zeroed out and doesn't contribute anything.

    My intuition said that something similar perhaps happens when calculating sample variation? Am I on the correct track?

    EDIT: After reading aleph zero's link I can see that I haven't enough knowledge of statistics to understand the answers to my question.. Well, maybe i'll try again in 6 months. Thanks 4 the help
    Last edited: Jan 10, 2014
  8. Jan 10, 2014 #7

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    AlephZero did not use the word "unbalanced". Both he and I used the word "unbiased."

    Suppose you have a set of samples {xi} of size N of a random variable X and you want estimate some parameter θ of the underlying probability distribution. For example, θ might be the mean, or the variance, or something more esoteric such as the probability that something bad will happen because of X. Finally, suppose you find some expression/algorithm/function f({xi}) that yields an estimate ϑ of the parameter θ from those collected samples.

    A key question: Is the value ϑ generated by f({xi}) a "good" estimate of θ? In other words, is f({xi}) a "good" estimator of θ?

    There are a number of ways of answering this question. One approach is to look at the behavior as the sample size N grows larger. The estimate ϑ should converge to θ as N→∞. An estimator that exhibits this behavior is called an "asymptotically consistent estimator". An estimator that isn't asymptotically consistent is close to worthless. For example, if I have billions of samples, I should be able to expect that the estimate is pretty close to spot on.

    Oftentimes we don't have the luxury of generating billions of samples. We might well only have a small handful of samples if it takes a lot of time or money to generate one sample. That an estimator is asymptotically consistent is no guarantee that it yields a "good" estimate when used against a finitely sized set of measurements.

    Note that ϑ is itself a random variable. Draw a new set of samples from X and you will get a different value for ϑ. Because ϑ is a random variable, it too has a mean. The bias of the estimator f is defined as b[ϑ]=E[ϑ]-θ. The estimator is "unbiased" if b[ϑ]=0 (in other words, if E[ϑ]=θ).

    Let's look at the uncorrected estimate of the variance s'2=1/n*∑(xi-x̄)2. As shown in AlephZero's link, E[s'2]=σ22/N, and thus the bias of this estimator is -σ2/N. This obviously is not zero. The uncorrected estimate is a biased estimate. On the other hand, the estimate that results from using N-1 as the denominator as opposed to N is an unbiased estimate.
  9. Jan 10, 2014 #8


    User Avatar
    Science Advisor

    It is whether you are linearly dependent with everyone else that is important. The set ##\{x_1, \ldots, x_n\}## is linearly independent. The set ##\{x_1, \ldots, x_n, \overline x\}## is linearly dependent.

    Because you are now using ##x_i## and ##\overline x##.
    Last edited: Jan 10, 2014
  10. Jan 10, 2014 #9


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    A nice heuristic explanation that may get the juices flowing into why you divide by n-1 instead of n is that given [itex] x_1,...,x_n [/itex], [itex] \overline{x}[/itex] is actually the number that minimizes
    [tex] \frac{1}{N} \sum_{i} (x_i - \overline{x})^2. [/tex]

    If I told you the actual mean of the random variable was [itex] \mu[/itex] (which in general will not be equal to the sample mean), you would probably guess that the variance is
    [tex] \frac{1}{N} \sum_{i} (x_i-\mu)^2[/tex]
    instead, right? This number will always be larger than the previous guess. So we know that we are underestimating the variance when we use
    [tex] \frac{1}{N} \sum_{i} (x_i - \overline{x})^2, [/tex]
    the only question is how much are we underestimating it by (which then gets into the unbiased explanation that others have posted above)/
    Last edited: Jan 10, 2014
  11. Jan 10, 2014 #10


    User Avatar
    Homework Helper

    Think about this: I tell you I'm thinking of six numbers, and that the sum of all six is zero. Then I say: the first five are 10, 20, 30, 40, 50.
    Could you give me the final number? Yes: because you know the sum is zero, that final number has to be -150. Here, even though I have six numbers in mind, it is enough to know 5 of them since the restriction that the sum equals zero will let you determine the remaining one:

    First generalization: For any set of n numbers, if the sum of those numbers is zero, only n -1 of them need to be specified since the remaining one is then determined.

    Now, for your question, the numbers are
    x_1 - \overline x, x_2 - \overline x, x_3 - \overline x, \dots, x_n - \overline x

    These add to zero, so even though n of them exist only n - 1 must be specified: the quantity n - 1 is the number of degrees of freedom for the set.

    The problem with this approach: it is really, really, not-statistical, and it doesn't generalize to other settings (at least not easily).
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook