Difference between sample standard deviation and population standard deviation?

Click For Summary

Homework Help Overview

The discussion revolves around the difference between sample standard deviation and population standard deviation, focusing on the formulas used for each. The original poster questions why the sample standard deviation divides by n-1 while the population standard deviation divides by n.

Discussion Character

  • Conceptual clarification, Mathematical reasoning

Approaches and Questions Raised

  • Participants explore the reasoning behind using n-1 in the sample standard deviation formula, with some discussing the implications of this choice for estimating population statistics. Others seek numerical examples to clarify the concept further.

Discussion Status

The conversation includes explanations about the unbiased nature of the sample variance and its role as an estimator for the population variance. Some participants express understanding, while others continue to seek clarification on specific points, indicating an ongoing exploration of the topic.

Contextual Notes

There is mention of the technical distinction between "sample variance" and "variance of the sample," as well as the convention of using n-1 for better approximation, which may contribute to confusion among participants.

NewtonianAlch
Messages
453
Reaction score
0

Homework Statement



Just as the title suggests, although this is more to do with the formula. I know that for a sample, it implies it's a subset of a population. Why in the formula do you divide by n-1, whereas for calculating standard deviation for a population you divide by the total amount of elements in it?
 
Physics news on Phys.org
Short Answer:
The idea is to use the sample to estimate the population statistics.
Dividing by n-1 gives you a better estimator for standard deviation.

Longer Answer:
The reason that n-1 is used instead of n in the formula for the sample variance is as follows: The sample variance can be thought of as a random variable, i.e. a function which takes on different values for different samples of the same distribution. Its use is as an estimate for the true variance of the distribution. In statistics, one typically does not know the true variance; one uses the sample variance to ESTIMATE the true variance. Since the sample variance is a random variable, it usually has a mean, or average value. One would hope that this average value is close to the actual value that the sample variance is estimating, i.e. close to the true variance. In fact, if the n-1 is used in the defining formula for the sample variance, then it is possible to prove that the average value of the sample variance EQUALS the true variance. If we replace the n-1 by an n, then the average value of the sample variance is ((n-1)/n) times as large as the true variance.

A random variable X which is used to estimate a parameter p of a distribution is called an unbiased estimator if the expected value of X equals p. Thus, using the n-1 gives an
unbiased estimator of the variance of a distribution.
 
OK, thanks for the response. I understand what is being said there in general. However I don't quite understand why n - 1 is still used. It's saying if n - 1 is used in the defining formula then it's possible to prove etc... I do not understand that part, can you give me a short numerical example?
 
The usual exercise is to get the student to work out the distribution of sample variances.

But I think the confusion arises over the terms used vis: the "sample variance" is a technical terms that does not quite mean the same thing as "the variance of the sample", but the "population variance" is the same thing as "the variance of the population".

The sample variance is an approximation to the population variance which is agreed upon by convention. The division by (n-1) gives a better approximation than the division by n (which would have given you the variance of the sample).

To see what they are doing: remember that the idea is to figure out what the population mean and variance is without actually polling the entire population. You could take a sample of 1000 out of a population of several million ... what can you say, in general, about the entire population, from such a small number?

You could find the mean and variance for the sample ... OK. But if you took another sample of 1000 tomorrow you will very likely get a different mean and variance from them.

If you take a lot of samples, and they are all random, then you get the mean-value-theorum giving you a distribution of means and variances which are, themselves, normal distributions.

If the population was normally distributed, then the mean of the means will get closer to the population mean as the number of samples increases but the mean of the variances (of the sample) will be bigger than the population variance.

You should be able to confirm that by working them out for just three or four random normal variables. You should know how to add random distributions by now.

What the passage quoted is saying is that if you define the sample variance to divide by (n-1) it is more convenient for estimating the population variance which is what we are after.

We don't have to do it that way, it's a convention.
 
Thanks for that Simon, I have a clearer understanding now.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
8K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 42 ·
2
Replies
42
Views
6K
  • · Replies 9 ·
Replies
9
Views
5K
Replies
4
Views
13K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K