Degree of freedom and formula for standard variation

In summary: A "good" estimator is asymptotically consistent and unbiased. A "bad" estimator is asymptotically inconsistent and/or biased.In the case of the sample variance, the standard formula yields a biased estimate of the population variance. The bias is -σ^2/N. The expression using N-1 instead of N yields an unbiased estimate.HTHNjorlIn summary, the conversation discusses the concept of "degree of freedom" in relation to the standard variance of a set with n points of data. The conversation also touches on the topic of unbiased estimates of variance and the role of sample size in calculating variance. The book
  • #1
Nikitin
735
27
For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?

My gut feeling agrees that the degree of freedom is n-1, but my brain does not understand. Can somebody explain it formally?

PS: My class statistics book is "Statistics for scientists and engineers, 9th ED". Is it crap (so far I don't like it)? You guys can recommend something better?
 
Physics news on Phys.org
  • #2
I never heard of the term "standard variance".

The variance of a sample of size N is ##\frac{1}{N}\sum_1^N (x_i-\bar x)^2##.

However that is not an unbiased estimate of the variance of the population. An unbiased estimate is ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

Some writers attempt to reduce the confusion by defining the "variance of a sample" as ##\frac{1}{N-1}\sum_1^N (x_i-\bar x)^2##.

IMO defining one thing incorrectly, to make it equal to something different, increases the confusion instead of reducing it, but it's very common.

How to do it right: https://maxwell.ict.griffith.edu.au/sso/biased_variance.pdf
 
  • Like
Likes 1 person
  • #3
Nikitin said:
For a set with n points of data, why is the "degree of freedom" of the standard variance n-1? Hell, what does "degree of freedom" actually mean?

Heck, my book "proves" this by saying that since ##\sum_1^n (x_i - \bar{x}) = 0## (obviously), then ##\sum_1^n (x_i - \bar{x})^2## must have n-1 independent pieces of information? Is this connection supposed to be obvious?
The set [itex]\{x_i-\bar x\}[/itex] certainly doesn't have n independent pieces of information. How can it? Those n pieces of information are not independent because they *must* sum to zero. That set has at most n -1 independent pieces of information. If the samples are truly independent then it does have n-1 pieces of information.

The degrees of freedom argument is nonetheless a bit ad hoc. A much better reason to use n-1 as the denominator as opposed to n is that the former yields an unbiased estimate of the variance. There's a rather sophisticated set of statistical machinery for what goes into forming unbiased estimates. This is however probably beyond the scope of an introductory statistics class.
 
  • Like
Likes 1 person
  • #4
Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?
 
  • #5
Nikitin said:
Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

You have x1 to xn, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##
Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.
 
  • Like
Likes 1 person
  • #6
pwsnafu said:
You have x1 to xn, but also ##\overline x##. These are linearly dependent: ##x_n = n \overline x - x_1 - \ldots x_{n-1}##
But this would be like saying ##x_n## is linearly dependant with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##. Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

Not that when calculating variance, you are using ##\overline x## in addition to the ##x_i## in your equation.

When I first read the formula for sample variation I had a gut feeling on why it was correct to divide by ##n-1##. I mean, if you were to calculate sample variation from an odd set by looking at the deviation from the median instead of the mean, [tex]\frac{\sum_1^n (x_i- \tilde{x})^2}{n-1}[/tex] then you would have to divide by ##n-1## because the middle ##x_i##-value equals the median and thus gets zeroed out and doesn't contribute anything.

My intuition said that something similar perhaps happens when calculating sample variation? Am I on the correct track?

EDIT: After reading aleph zero's link I can see that I haven't enough knowledge of statistics to understand the answers to my question.. Well, maybe i'll try again in 6 months. Thanks 4 the help
 
Last edited:
  • #7
Nikitin said:
Thanks for all replies!

Aleph, I'm sorry I meant "sample variance". Why is the estimate unbalanced though? What does unbalanced mean?

DH: Why doesn't the set have "n independant pieces of information" just because the sum of x_i- x_mean =0?

And what does "unbiased" mean?
AlephZero did not use the word "unbalanced". Both he and I used the word "unbiased."

Suppose you have a set of samples {xi} of size N of a random variable X and you want estimate some parameter θ of the underlying probability distribution. For example, θ might be the mean, or the variance, or something more esoteric such as the probability that something bad will happen because of X. Finally, suppose you find some expression/algorithm/function f({xi}) that yields an estimate ϑ of the parameter θ from those collected samples.

A key question: Is the value ϑ generated by f({xi}) a "good" estimate of θ? In other words, is f({xi}) a "good" estimator of θ?

There are a number of ways of answering this question. One approach is to look at the behavior as the sample size N grows larger. The estimate ϑ should converge to θ as N→∞. An estimator that exhibits this behavior is called an "asymptotically consistent estimator". An estimator that isn't asymptotically consistent is close to worthless. For example, if I have billions of samples, I should be able to expect that the estimate is pretty close to spot on.

Oftentimes we don't have the luxury of generating billions of samples. We might well only have a small handful of samples if it takes a lot of time or money to generate one sample. That an estimator is asymptotically consistent is no guarantee that it yields a "good" estimate when used against a finitely sized set of measurements.

Note that ϑ is itself a random variable. Draw a new set of samples from X and you will get a different value for ϑ. Because ϑ is a random variable, it too has a mean. The bias of the estimator f is defined as b[ϑ]=E[ϑ]-θ. The estimator is "unbiased" if b[ϑ]=0 (in other words, if E[ϑ]=θ).

Let's look at the uncorrected estimate of the variance s'2=1/n*∑(xi-x̄)2. As shown in AlephZero's link, E[s'2]=σ22/N, and thus the bias of this estimator is -σ2/N. This obviously is not zero. The uncorrected estimate is a biased estimate. On the other hand, the estimate that results from using N-1 as the denominator as opposed to N is an unbiased estimate.
 
  • Like
Likes 1 person
  • #8
Nikitin said:
But this would be like saying ##x_n## is linearly dependant with ##x_n## since ##x_n=n \bar{x}-x_1-x_2-x_3..-x_{n-1}=x_1+x_2+x_3..+x_{n}-x_1-x_2-x_3..-x_{n-1} = x_n##.

It is whether you are linearly dependent with everyone else that is important. The set ##\{x_1, \ldots, x_n\}## is linearly independent. The set ##\{x_1, \ldots, x_n, \overline x\}## is linearly dependent.

Considering the original data-set had n pieces of independent information, why does 1 piece suddenly disappear?

Because you are now using ##x_i## and ##\overline x##.
 
Last edited:
  • #9
A nice heuristic explanation that may get the juices flowing into why you divide by n-1 instead of n is that given [itex] x_1,...,x_n [/itex], [itex] \overline{x}[/itex] is actually the number that minimizes
[tex] \frac{1}{N} \sum_{i} (x_i - \overline{x})^2. [/tex]

If I told you the actual mean of the random variable was [itex] \mu[/itex] (which in general will not be equal to the sample mean), you would probably guess that the variance is
[tex] \frac{1}{N} \sum_{i} (x_i-\mu)^2[/tex]
instead, right? This number will always be larger than the previous guess. So we know that we are underestimating the variance when we use
[tex] \frac{1}{N} \sum_{i} (x_i - \overline{x})^2, [/tex]
the only question is how much are we underestimating it by (which then gets into the unbiased explanation that others have posted above)/
 
Last edited:
  • Like
Likes 1 person
  • #10
Think about this: I tell you I'm thinking of six numbers, and that the sum of all six is zero. Then I say: the first five are 10, 20, 30, 40, 50.
Could you give me the final number? Yes: because you know the sum is zero, that final number has to be -150. Here, even though I have six numbers in mind, it is enough to know 5 of them since the restriction that the sum equals zero will let you determine the remaining one:

First generalization: For any set of n numbers, if the sum of those numbers is zero, only n -1 of them need to be specified since the remaining one is then determined.

Now, for your question, the numbers are
[tex]
x_1 - \overline x, x_2 - \overline x, x_3 - \overline x, \dots, x_n - \overline x
[/tex]

These add to zero, so even though n of them exist only n - 1 must be specified: the quantity n - 1 is the number of degrees of freedom for the set.

The problem with this approach: it is really, really, not-statistical, and it doesn't generalize to other settings (at least not easily).
 

Related to Degree of freedom and formula for standard variation

1. What is the concept of degree of freedom?

The degree of freedom is a statistical concept that refers to the number of independent pieces of information that are available to estimate a population parameter. In other words, it is the number of values that are free to vary in a calculation or statistical analysis.

2. How is degree of freedom calculated?

The formula for calculating the degree of freedom depends on the type of statistical test or analysis being performed. In general, it is equal to the sample size minus the number of parameters being estimated. For example, in a t-test comparing two means, the degree of freedom would be equal to the sample size of each group minus two.

3. What is the formula for standard variation?

The formula for standard variation is the square root of the sum of squared deviations from the mean divided by the number of observations. It is commonly denoted as s or σ (sigma). The formula for standard deviation can be written as: σ = √(Σ(x-μ)²/n).

4. How is the formula for standard variation used in data analysis?

The formula for standard variation is used to calculate the spread or variability of a dataset. It is a measure of how much the data points deviate from the mean. Standard deviation is commonly used in inferential statistics to assess the reliability and significance of results, as well as to make comparisons between groups.

5. What are the limitations of using standard variation?

While standard variation is a useful measure of variability, it has some limitations. It assumes that the data is normally distributed and can be affected by outliers. Additionally, it does not account for the shape of the distribution or the presence of any underlying patterns in the data. Other measures such as interquartile range or coefficient of variation may be more appropriate in certain situations.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
832
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
934
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
Back
Top