Standard deviation question -- population std vs sample std

In summary: Students introduced to statistics tend to view a statistics problem as something involving a single random variable... i.e. the population variance.
  • #1
olgerm
Gold Member
532
34
TL;DR Summary
population std vs sample std
I know that standard deviation of whatever data is defined as sqaure root of square difference from mean value:
##\sigma(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|}##

but sometimes formula ##\sigma_2(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|-1}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|-1}## is used.

Does the 2. formula:
  • estimate population standard deviation based on sample?
  • estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
  • is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
  • assume that population is much larger than sample?
  • assume something more?
 
Last edited:
Physics news on Phys.org
  • #2
When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.
 
  • #3
mathman said:
When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.
True variance of what?
Variance of means of all samples(with the same size) that can be taken from population?
Or Variance of whole population?
Or Variance of something else?
 
  • #4
To be precise, using either ##n## or ##n - 1## gives an estimator of the true variance. And what is meant by that is the population variance. Because you are calculating from a sample, and samples differ, these sample variances are random variables. If you did 1000 experiments and calculated the sample variance from each one, you'd get a distribution of sample variance values.

As I said, both of these statistics are estimators of the population variance. But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance. And unbiased estimators are what we want.

On the other hand, if you use the sample variance with ##n## then on average it's a little lower than the population value, by a factor of ##(n-1)/n##. It's a biased estimator.
 
  • #5
RPinPA said:
But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance.
population= {0, 1, 2}
sample= (0, 1) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
sample= (0, 2) ;sigma(sample)= 1.414213562373095 D(sample)= 2
sample= (1, 2) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
mean sample stdev with n-1= 0.942809041582063
true std(population)= 0.81649658092772
mean sample variance with n-1= 1
true variance(population)=2/3

In this example average value of the sample variances with ##n-1## is not the population variance.
 
  • #6
olgerm said:
In this example average value of the sample variances with ##n-1## is not the population variance.

To discuss statistics clearly you must negotiate its ambiguous vocablulary.

Let's take "variance", for example.

The "Population variance" (for a particular population) is a constant. Since the population variance is a constant, it doesn't depend on the values of particular samples from the population. So you are correct that the population variance need not be equal to the "sample variance", regardless of how one computes that sample variance from a few samples. In most sophisticated problems, the population is not given as a list of indviduals. Instead we speak of a "probability distribution" that defines the probabilities for obtaining each possible sample from the population. Instead of a "population variance" we talk about the variance of the given probability distribution.

The term "sample variance" might mean:

1) A constant. For example, "I measured the weights of 10 people and the sample variance was 15.3 lbs^2" This use of "sample variance" is only unambiguous if the speaker specifies the particular formula he used to compute the sample variance.

2) An estimator. An "estimator" is a function that uses the data in a sample to estimate something, usually to estimate some property of the population. For a sample of ##n## values, ##( \sum_{i=1}^n (x_i - \mu)^2)/ n## and ##( \sum_{i=1}^n (x_i - \mu)^2) / (n-1)## and ##( \sum_{i=1}^n (x_i - m)^2/n) ## and ##( \sum_{i=1}^n(x_i - m)^2)/ (n-1)## are each estimators of the population variance. Here ##\mu## is the population mean and ##m## is the mean of the sample (i.e. ##(\sum_{i=1}^n x_i)/ n##). Of course we can't use a formula that requires ##\mu## unless we know the mean value of the population. In a typical situation, we don't know it, so we use a formula that employs ##m##, the mean of the sample we took.

3) A random variable. A formula that depends on the values of a random sample, produces random values. So a particular "estimator" for the population variance is a random variable. Random variables have distributions. Since an estimator is a random variable, it has its own probability distribution. The distribution of an estimator is usually not the same as the distribution of the population being sampled. That estimator's distribution has its own mean and variance, which might be different than the mean and variance of the population.

Students introduced to statistics tend to view a statistics problem as something involving a single random variable (or a single population) with a single distribution. They don't conceptualize that the various formulae applied to sample data are introducing additional random variables.

The question of which estimator is "true" or "best" is ambiguous until specific criteria for "best" are defined. Many people prefer to use ## \hat{\sigma^2} = \sum_{i=1}^n(x_i - m)^2)/ (n-1)## to estimate the population variance, because the mean of the distribution of ##\hat{\sigma^2}## is the same as the population variance. If the mean value of the distribution of an estimator is equal to the value of the thing it attempts to estimate, then that estimator is called "unbiased".

If you want to check (or disprove) that the mean of ##\hat{ \sigma^2}## is the population variance, you can't do this by only looking a few samples. You must look at all possible samples that can be taken, each weighted by the probability it happens. The mean value of an estimator can be regarded as the population mean of the population of all possible values of the estimator. The possible values of the estimator come from applying it to all possible samples from the population that contains the data the estimator uses.

olgerm said:
Does the 2. formula:
  • estimate population standard deviation based on sample?
  • I think you meant "population variance" since your two formulae don't employ a square root. Yes, they are estimators of the population variance.

  • [*]estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
    [*]
  • Technically, any formula could be declared to be an estimator of any property of a population. Of course it might not be a useful estimator of the property! The two formulae you gave are not useful estimators for the variance of the sample mean.

  • [*]is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
    [*]
  • It can be regarded as an estimator of the variance of the sample mean of n values, considering the sample mean to be a random variable.
  • [*]assume that population is much larger than sample?
    [*]assume something more?
    [*]
The only assumptions estimators make is that they define functions that can be computed. For example, dividing by n-1 assumes n-1 is not zero. Employing the population mean ##\mu## in the formula for an estimator assumes the population's distribution has a mean. (Some probability distributions do not have a mean value -e.g. the Cauchy distribution.)

The situations where further assumptions may be needed are those where we want to prove an estimator has certain properties, such as being unbiased.
 
  • Like
Likes roam and RPinPA

Related to Standard deviation question -- population std vs sample std

1. What is the difference between population standard deviation and sample standard deviation?

The population standard deviation (σ) measures the variability of a population of data, while the sample standard deviation (s) measures the variability of a sample of data. The population standard deviation is calculated using the entire population of data, while the sample standard deviation is calculated using a subset of the population (i.e. a sample).

2. Why is it important to distinguish between population standard deviation and sample standard deviation?

It is important to distinguish between population standard deviation and sample standard deviation because they are used for different purposes. The population standard deviation is used to describe the entire population, while the sample standard deviation is used to estimate the population standard deviation based on a sample. Using the wrong standard deviation can lead to incorrect conclusions about the data.

3. How do you calculate population standard deviation?

To calculate population standard deviation, you first need to find the mean of the population. Then, for each data point, subtract the mean and square the result. Next, find the average of these squared differences. Finally, take the square root of this average to get the population standard deviation.

4. How do you calculate sample standard deviation?

To calculate sample standard deviation, you first need to find the mean of the sample. Then, for each data point, subtract the mean and square the result. Next, find the sum of these squared differences. Finally, divide the sum by the sample size minus one and take the square root to get the sample standard deviation.

5. Can sample standard deviation be used as an estimate for population standard deviation?

Yes, sample standard deviation can be used as an estimate for population standard deviation. However, it is important to note that as the sample size increases, the sample standard deviation becomes a more accurate estimate of the population standard deviation. Therefore, it is recommended to use the sample standard deviation when the sample size is large enough to provide a reliable estimate.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
628
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
644
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
Back
Top