Undergrad Standard deviation question -- population std vs sample std

olgerm · May 21, 2019

I know that standard deviation of whatever data is defined as sqaure root of square difference from mean value:
##\sigma(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|}##

but sometimes formula ##\sigma_2(data)=\frac{\sum_{x \in data}((x-x_{mean\ of\ data})^2)}{|data|-1}=\frac{\sum_{x \in data}((x-\sum_{y \in data}(y)/|data|)^2)}{|data|-1}## is used.

Does the 2. formula:

estimate population standard deviation based on sample?
estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
assume that population is much larger than sample?
assume something more?

mathman · May 21, 2019

When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.

olgerm · May 21, 2019

mathman said:

When you calculate the expectation of the estimated variance you find that n-1 is needed to make this estimate equal to the true variance.

True variance of what?
Variance of means of all samples(with the same size) that can be taken from population?
Or Variance of whole population?
Or Variance of something else?

RPinPA · May 21, 2019

To be precise, using either ##n## or ##n - 1## gives an estimator of the true variance. And what is meant by that is the population variance. Because you are calculating from a sample, and samples differ, these sample variances are random variables. If you did 1000 experiments and calculated the sample variance from each one, you'd get a distribution of sample variance values.

As I said, both of these statistics are estimators of the population variance. But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance. And unbiased estimators are what we want.

On the other hand, if you use the sample variance with ##n## then on average it's a little lower than the population value, by a factor of ##(n-1)/n##. It's a biased estimator.

olgerm · May 21, 2019

RPinPA said:

But you can prove that using ##n-1## gives you an unbiased estimator. The average value of the sample variance with ##n-1## is the population variance.

population= {0, 1, 2}
sample= (0, 1) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
sample= (0, 2) ;sigma(sample)= 1.414213562373095 D(sample)= 2
sample= (1, 2) ;sigma(sample)= 0.707106781186547 D(sample)= 1/2
mean sample stdev with n-1= 0.942809041582063
true std(population)= 0.81649658092772
mean sample variance with n-1= 1
true variance(population)=2/3

In this example average value of the sample variances with ##n-1## is not the population variance.

Stephen Tashi · May 21, 2019

olgerm said:

In this example average value of the sample variances with ##n-1## is not the population variance.

To discuss statistics clearly you must negotiate its ambiguous vocablulary.

Let's take "variance", for example.

The "Population variance" (for a particular population) is a constant. Since the population variance is a constant, it doesn't depend on the values of particular samples from the population. So you are correct that the population variance need not be equal to the "sample variance", regardless of how one computes that sample variance from a few samples. In most sophisticated problems, the population is not given as a list of indviduals. Instead we speak of a "probability distribution" that defines the probabilities for obtaining each possible sample from the population. Instead of a "population variance" we talk about the variance of the given probability distribution.

The term "sample variance" might mean:

1) A constant. For example, "I measured the weights of 10 people and the sample variance was 15.3 lbs^2" This use of "sample variance" is only unambiguous if the speaker specifies the particular formula he used to compute the sample variance.

2) An estimator. An "estimator" is a function that uses the data in a sample to estimate something, usually to estimate some property of the population. For a sample of ##n## values, ##( \sum_{i=1}^n (x_i - \mu)^2)/ n## and ##( \sum_{i=1}^n (x_i - \mu)^2) / (n-1)## and ##( \sum_{i=1}^n (x_i - m)^2/n) ## and ##( \sum_{i=1}^n(x_i - m)^2)/ (n-1)## are each estimators of the population variance. Here ##\mu## is the population mean and ##m## is the mean of the sample (i.e. ##(\sum_{i=1}^n x_i)/ n##). Of course we can't use a formula that requires ##\mu## unless we know the mean value of the population. In a typical situation, we don't know it, so we use a formula that employs ##m##, the mean of the sample we took.

3) A random variable. A formula that depends on the values of a random sample, produces random values. So a particular "estimator" for the population variance is a random variable. Random variables have distributions. Since an estimator is a random variable, it has its own probability distribution. The distribution of an estimator is usually not the same as the distribution of the population being sampled. That estimator's distribution has its own mean and variance, which might be different than the mean and variance of the population.

Students introduced to statistics tend to view a statistics problem as something involving a single random variable (or a single population) with a single distribution. They don't conceptualize that the various formulae applied to sample data are introducing additional random variables.

The question of which estimator is "true" or "best" is ambiguous until specific criteria for "best" are defined. Many people prefer to use ## \hat{\sigma^2} = \sum_{i=1}^n(x_i - m)^2)/ (n-1)## to estimate the population variance, because the mean of the distribution of ##\hat{\sigma^2}## is the same as the population variance. If the mean value of the distribution of an estimator is equal to the value of the thing it attempts to estimate, then that estimator is called "unbiased".

If you want to check (or disprove) that the mean of ##\hat{ \sigma^2}## is the population variance, you can't do this by only looking a few samples. You must look at all possible samples that can be taken, each weighted by the probability it happens. The mean value of an estimator can be regarded as the population mean of the population of all possible values of the estimator. The possible values of the estimator come from applying it to all possible samples from the population that contains the data the estimator uses.

olgerm said:

Does the 2. formula:

estimate population standard deviation based on sample?

I think you meant "population variance" since your two formulae don't employ a square root. Yes, they are estimators of the population variance.
[*]estimate of standard deviation of means of all samples(with the same size) that can be taken from population?
[*]
Technically, any formula could be declared to be an estimator of any property of a population. Of course it might not be a useful estimator of the property! The two formulae you gave are not useful estimators for the variance of the sample mean.
[*]is ##\sigma_2/\sqrt{n}## estimation of standard deviation of means of all samples(with the same size) that can be taken from population?
[*]
It can be regarded as an estimator of the variance of the sample mean of n values, considering the sample mean to be a random variable.
[*]assume that population is much larger than sample?
[*]assume something more?
[*]

The only assumptions estimators make is that they define functions that can be computed. For example, dividing by n-1 assumes n-1 is not zero. Employing the population mean ##\mu## in the formula for an estimator assumes the population's distribution has a mean. (Some probability distributions do not have a mean value -e.g. the Cauchy distribution.)

The situations where further assumptions may be needed are those where we want to prove an estimator has certain properties, such as being unbiased.

Undergrad Standard deviation question -- population std vs sample std

Thread 'Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense'

Similar threads

Undergrad A variant of the Monty Hall problem

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

High School How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

High School Onto set mapping is the surjective set mapping, and into injective?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers