Undergrad Standard deviation question -- population std vs sample std

Click For Summary
The discussion focuses on the differences between population and sample standard deviation, emphasizing the formulas used for each. The formula using n-1 provides an unbiased estimator of the population variance, while using n results in a biased estimator that is typically lower than the true population variance. The conversation also highlights that both formulas serve as estimators of the population variance, but the choice of n-1 is preferred for its unbiased nature. Additionally, it is noted that estimators can be random variables with their own distributions, complicating the understanding of statistical properties. Clear definitions and assumptions are crucial for accurate statistical analysis and interpretation.
olgerm
Gold Member
Messages
532
Reaction score
35
TL;DR
population std vs sample std
I know that standard deviation of whatever data is defined as sqaure root of square difference from mean value:
##\sigma(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|}##

but sometimes formula ##\sigma_2(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|-1}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|-1}## is used.

Does the 2. formula:
  • estimate population standard deviation based on sample?
  • estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
  • is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
  • assume that population is much larger than sample?
  • assume something more?
 
Last edited:
Physics news on Phys.org
When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.
 
mathman said:
When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.
True variance of what?
Variance of means of all samples(with the same size) that can be taken from population?
Or Variance of whole population?
Or Variance of something else?
 
To be precise, using either ##n## or ##n - 1## gives an estimator of the true variance. And what is meant by that is the population variance. Because you are calculating from a sample, and samples differ, these sample variances are random variables. If you did 1000 experiments and calculated the sample variance from each one, you'd get a distribution of sample variance values.

As I said, both of these statistics are estimators of the population variance. But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance. And unbiased estimators are what we want.

On the other hand, if you use the sample variance with ##n## then on average it's a little lower than the population value, by a factor of ##(n-1)/n##. It's a biased estimator.
 
RPinPA said:
But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance.
population= {0, 1, 2}
sample= (0, 1) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
sample= (0, 2) ;sigma(sample)= 1.414213562373095 D(sample)= 2
sample= (1, 2) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
mean sample stdev with n-1= 0.942809041582063
true std(population)= 0.81649658092772
mean sample variance with n-1= 1
true variance(population)=2/3

In this example average value of the sample variances with ##n-1## is not the population variance.
 
olgerm said:
In this example average value of the sample variances with ##n-1## is not the population variance.

To discuss statistics clearly you must negotiate its ambiguous vocablulary.

Let's take "variance", for example.

The "Population variance" (for a particular population) is a constant. Since the population variance is a constant, it doesn't depend on the values of particular samples from the population. So you are correct that the population variance need not be equal to the "sample variance", regardless of how one computes that sample variance from a few samples. In most sophisticated problems, the population is not given as a list of indviduals. Instead we speak of a "probability distribution" that defines the probabilities for obtaining each possible sample from the population. Instead of a "population variance" we talk about the variance of the given probability distribution.

The term "sample variance" might mean:

1) A constant. For example, "I measured the weights of 10 people and the sample variance was 15.3 lbs^2" This use of "sample variance" is only unambiguous if the speaker specifies the particular formula he used to compute the sample variance.

2) An estimator. An "estimator" is a function that uses the data in a sample to estimate something, usually to estimate some property of the population. For a sample of ##n## values, ##( \sum_{i=1}^n (x_i - \mu)^2)/ n## and ##( \sum_{i=1}^n (x_i - \mu)^2) / (n-1)## and ##( \sum_{i=1}^n (x_i - m)^2/n) ## and ##( \sum_{i=1}^n(x_i - m)^2)/ (n-1)## are each estimators of the population variance. Here ##\mu## is the population mean and ##m## is the mean of the sample (i.e. ##(\sum_{i=1}^n x_i)/ n##). Of course we can't use a formula that requires ##\mu## unless we know the mean value of the population. In a typical situation, we don't know it, so we use a formula that employs ##m##, the mean of the sample we took.

3) A random variable. A formula that depends on the values of a random sample, produces random values. So a particular "estimator" for the population variance is a random variable. Random variables have distributions. Since an estimator is a random variable, it has its own probability distribution. The distribution of an estimator is usually not the same as the distribution of the population being sampled. That estimator's distribution has its own mean and variance, which might be different than the mean and variance of the population.

Students introduced to statistics tend to view a statistics problem as something involving a single random variable (or a single population) with a single distribution. They don't conceptualize that the various formulae applied to sample data are introducing additional random variables.

The question of which estimator is "true" or "best" is ambiguous until specific criteria for "best" are defined. Many people prefer to use ## \hat{\sigma^2} = \sum_{i=1}^n(x_i - m)^2)/ (n-1)## to estimate the population variance, because the mean of the distribution of ##\hat{\sigma^2}## is the same as the population variance. If the mean value of the distribution of an estimator is equal to the value of the thing it attempts to estimate, then that estimator is called "unbiased".

If you want to check (or disprove) that the mean of ##\hat{ \sigma^2}## is the population variance, you can't do this by only looking a few samples. You must look at all possible samples that can be taken, each weighted by the probability it happens. The mean value of an estimator can be regarded as the population mean of the population of all possible values of the estimator. The possible values of the estimator come from applying it to all possible samples from the population that contains the data the estimator uses.

olgerm said:
Does the 2. formula:
  • estimate population standard deviation based on sample?
  • I think you meant "population variance" since your two formulae don't employ a square root. Yes, they are estimators of the population variance.

  • [*]estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
    [*]
  • Technically, any formula could be declared to be an estimator of any property of a population. Of course it might not be a useful estimator of the property! The two formulae you gave are not useful estimators for the variance of the sample mean.

  • [*]is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
    [*]
  • It can be regarded as an estimator of the variance of the sample mean of n values, considering the sample mean to be a random variable.
  • [*]assume that population is much larger than sample?
    [*]assume something more?
    [*]
The only assumptions estimators make is that they define functions that can be computed. For example, dividing by n-1 assumes n-1 is not zero. Employing the population mean ##\mu## in the formula for an estimator assumes the population's distribution has a mean. (Some probability distributions do not have a mean value -e.g. the Cauchy distribution.)

The situations where further assumptions may be needed are those where we want to prove an estimator has certain properties, such as being unbiased.
 
  • Like
Likes roam and RPinPA
The standard _A " operator" maps a Null Hypothesis Ho into a decision set { Do not reject:=1 and reject :=0}. In this sense ( HA)_A , makes no sense. Since H0, HA aren't exhaustive, can we find an alternative operator, _A' , so that ( H_A)_A' makes sense? Isn't Pearson Neyman related to this? Hope I'm making sense. Edit: I was motivated by a superficial similarity of the idea with double transposition of matrices M, with ## (M^{T})^{T}=M##, and just wanted to see if it made sense to talk...

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 42 ·
2
Replies
42
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 7 ·
Replies
7
Views
3K