Standard deviation question -- population std vs sample std

Click For Summary

Discussion Overview

The discussion revolves around the differences between population standard deviation and sample standard deviation, focusing on the formulas used to calculate them and the implications of using different denominators (n vs n-1). Participants explore the conditions under which these formulas are applicable and the interpretations of variance in statistical terms.

Discussion Character

  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants propose that the formula using n-1 provides an unbiased estimator of the population variance, while using n results in a biased estimator.
  • There is a question about whether the second formula estimates the population standard deviation based on a sample or the standard deviation of means of all samples of the same size.
  • Concerns are raised regarding the interpretation of "true variance" and whether it refers to the variance of the population, the variance of sample means, or something else.
  • One participant emphasizes that both formulas yield estimators of the population variance, but the average of the sample variance with n-1 is unbiased.
  • Another participant discusses the ambiguity in statistical vocabulary, particularly regarding the terms "population variance" and "sample variance," and how they can represent constants, estimators, or random variables.
  • There is a suggestion that the estimators must be evaluated based on their definitions and the assumptions they entail, such as the population being larger than the sample.

Areas of Agreement / Disagreement

Participants express differing views on the interpretations of variance and the appropriateness of the formulas used for estimation. There is no consensus on the implications of using n vs n-1, and the discussion remains unresolved regarding the best approach to defining and estimating variance.

Contextual Notes

Participants note that the definitions and assumptions underlying the estimators can significantly affect their interpretations and applications. The discussion highlights the need for clarity in statistical terminology and the conditions under which different estimators are valid.

olgerm
Gold Member
Messages
536
Reaction score
37
TL;DR
population std vs sample std
I know that standard deviation of whatever data is defined as sqaure root of square difference from mean value:
##\sigma(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|}##

but sometimes formula ##\sigma_2(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|-1}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|-1}## is used.

Does the 2. formula:
  • estimate population standard deviation based on sample?
  • estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
  • is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
  • assume that population is much larger than sample?
  • assume something more?
 
Last edited:
Physics news on Phys.org
When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.
 
mathman said:
When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.
True variance of what?
Variance of means of all samples(with the same size) that can be taken from population?
Or Variance of whole population?
Or Variance of something else?
 
To be precise, using either ##n## or ##n - 1## gives an estimator of the true variance. And what is meant by that is the population variance. Because you are calculating from a sample, and samples differ, these sample variances are random variables. If you did 1000 experiments and calculated the sample variance from each one, you'd get a distribution of sample variance values.

As I said, both of these statistics are estimators of the population variance. But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance. And unbiased estimators are what we want.

On the other hand, if you use the sample variance with ##n## then on average it's a little lower than the population value, by a factor of ##(n-1)/n##. It's a biased estimator.
 
RPinPA said:
But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance.
population= {0, 1, 2}
sample= (0, 1) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
sample= (0, 2) ;sigma(sample)= 1.414213562373095 D(sample)= 2
sample= (1, 2) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
mean sample stdev with n-1= 0.942809041582063
true std(population)= 0.81649658092772
mean sample variance with n-1= 1
true variance(population)=2/3

In this example average value of the sample variances with ##n-1## is not the population variance.
 
olgerm said:
In this example average value of the sample variances with ##n-1## is not the population variance.

To discuss statistics clearly you must negotiate its ambiguous vocablulary.

Let's take "variance", for example.

The "Population variance" (for a particular population) is a constant. Since the population variance is a constant, it doesn't depend on the values of particular samples from the population. So you are correct that the population variance need not be equal to the "sample variance", regardless of how one computes that sample variance from a few samples. In most sophisticated problems, the population is not given as a list of indviduals. Instead we speak of a "probability distribution" that defines the probabilities for obtaining each possible sample from the population. Instead of a "population variance" we talk about the variance of the given probability distribution.

The term "sample variance" might mean:

1) A constant. For example, "I measured the weights of 10 people and the sample variance was 15.3 lbs^2" This use of "sample variance" is only unambiguous if the speaker specifies the particular formula he used to compute the sample variance.

2) An estimator. An "estimator" is a function that uses the data in a sample to estimate something, usually to estimate some property of the population. For a sample of ##n## values, ##( \sum_{i=1}^n (x_i - \mu)^2)/ n## and ##( \sum_{i=1}^n (x_i - \mu)^2) / (n-1)## and ##( \sum_{i=1}^n (x_i - m)^2/n) ## and ##( \sum_{i=1}^n(x_i - m)^2)/ (n-1)## are each estimators of the population variance. Here ##\mu## is the population mean and ##m## is the mean of the sample (i.e. ##(\sum_{i=1}^n x_i)/ n##). Of course we can't use a formula that requires ##\mu## unless we know the mean value of the population. In a typical situation, we don't know it, so we use a formula that employs ##m##, the mean of the sample we took.

3) A random variable. A formula that depends on the values of a random sample, produces random values. So a particular "estimator" for the population variance is a random variable. Random variables have distributions. Since an estimator is a random variable, it has its own probability distribution. The distribution of an estimator is usually not the same as the distribution of the population being sampled. That estimator's distribution has its own mean and variance, which might be different than the mean and variance of the population.

Students introduced to statistics tend to view a statistics problem as something involving a single random variable (or a single population) with a single distribution. They don't conceptualize that the various formulae applied to sample data are introducing additional random variables.

The question of which estimator is "true" or "best" is ambiguous until specific criteria for "best" are defined. Many people prefer to use ## \hat{\sigma^2} = \sum_{i=1}^n(x_i - m)^2)/ (n-1)## to estimate the population variance, because the mean of the distribution of ##\hat{\sigma^2}## is the same as the population variance. If the mean value of the distribution of an estimator is equal to the value of the thing it attempts to estimate, then that estimator is called "unbiased".

If you want to check (or disprove) that the mean of ##\hat{ \sigma^2}## is the population variance, you can't do this by only looking a few samples. You must look at all possible samples that can be taken, each weighted by the probability it happens. The mean value of an estimator can be regarded as the population mean of the population of all possible values of the estimator. The possible values of the estimator come from applying it to all possible samples from the population that contains the data the estimator uses.

olgerm said:
Does the 2. formula:
  • estimate population standard deviation based on sample?
  • I think you meant "population variance" since your two formulae don't employ a square root. Yes, they are estimators of the population variance.

  • [*]estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
    [*]
  • Technically, any formula could be declared to be an estimator of any property of a population. Of course it might not be a useful estimator of the property! The two formulae you gave are not useful estimators for the variance of the sample mean.

  • [*]is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
    [*]
  • It can be regarded as an estimator of the variance of the sample mean of n values, considering the sample mean to be a random variable.
  • [*]assume that population is much larger than sample?
    [*]assume something more?
    [*]
The only assumptions estimators make is that they define functions that can be computed. For example, dividing by n-1 assumes n-1 is not zero. Employing the population mean ##\mu## in the formula for an estimator assumes the population's distribution has a mean. (Some probability distributions do not have a mean value -e.g. the Cauchy distribution.)

The situations where further assumptions may be needed are those where we want to prove an estimator has certain properties, such as being unbiased.
 
  • Like
Likes   Reactions: roam and RPinPA

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 42 ·
2
Replies
42
Views
6K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 18 ·
Replies
18
Views
3K
Replies
1
Views
1K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K