# Calculation of sample variance

## Main Question or Discussion Point

I am new in statistic. I come across the sample variance calculation in a book and it explains that denominator is divided by n-1 instead of n is because variance in samples will be likely to be lower than the population variance, so we divide by n-1 to make the variance larger.

However, when I studied t-distribution, with small n, the distribution has fat tail. With larger n, the tail of distribution becomes thinner. So it seems like with small n, it has larger variance. If I treat small n as the case of the sampling above, the value of variance seems to contradict each other (first case states it would be smaller and second case states it would be larger). Could someone help me with this?

Thanks.

Related Set Theory, Logic, Probability, Statistics News on Phys.org
Stephen Tashi
Could someone help me with this?
The first thing to do is to get your terminology straightened out. When you talk about "the variance" you aren't making any distinction between the the variance of a population versus estimates of that variance computed from a sample. And you are making statements about different populations without saying what they are.

One of the reasons that statistics gets conceptually complicated is that the typical scenario involves at least two populations. The first population is usually something simple like the population of peoples weights. This population will usually have a distribution (such as a lognormal distribution) that is defined by a set of parameters (such as the mean and variance).

In a problem where we are attempting to estimate these parameters, we usually do a computation based on the values of N independent samples from the first population. The result of this computation is a "statistic". Since the sample values are random, a "statistic" is a random variable. This is in contradiction to the layman's idea that a "statistic" is a single numerical value. It is the population parameters, such as the mean weight of the population, that can be thought of as single numerical values. One may also think of the sample mean from one particular sample as a single numerical value. But a "statistic" is a random variable. The statistic has its own population of possible values. This population has a probability distribution that is usually defined by its own set of parameters (mean ,variance etc.). So we have a second population involved.

Try to express what is bothering you using the correct terminology and see if there is really any contradiction involved.

mathman
When one calculates the mean (mathematical expectation) of a sample variance, the factor n-1 is needed so that the mean of the sample variance equals the population variance.

When one calculates the mean (mathematical expectation) of a sample variance, the factor n-1 is needed so that the mean of the sample variance equals the population variance.
This. If the mean is known, you would compute the variance as follows:

$$E[\frac{\Sigma(X_i - \mu)^2}{n}] = \frac{\Sigma E[(X_i - \mu)^2]}{n} = \frac{\Sigma \sigma^2}{n} = \frac{n * \sigma^2}{n} = \sigma^2$$

If the mean is unknown, you have to estimate it with the sample mean, x-bar, and estimate the variance using the sample variance, which has a different mean as you will see:

$$E[\Sigma(X_{i} - \bar{x})^2] = E[\Sigma(X_i^2 - 2X_i + \bar{x}^2] = E[\Sigma(X_i^2 - 2X_i + \bar{x}^2 + 2\mu X_i - 2\mu X_i + \mu ^2 - \mu ^2] = E[\Sigma(X_i - \mu )^2 - (\bar{x} - \mu)^2] = \Sigma E[(X_i - \mu )^2] - \Sigma E[(\bar{x} - \mu)^2$$

so,

$$E[\Sigma(X_{i} - \bar{x})^2] = n * Var(X_i) - n * Var(\bar{x}) = n * \sigma ^2 - \sigma ^2 = (n-1) \sigma ^2$$

To get rid of the n-1 and make it unbiased, we use the sample variance with the n-1 in the denominator as you see in your textbook:

$$s^2 = \frac{\Sigma (X_i - \bar{x})^2}{n - 1}$$

Why I think there is contradiction is because:

Let's say for the case of t distribution:

1.) If the sample size, n is small, it has fatter tail and larger variance. It is like I taking sample out of the large population to estimates the variance.
2.) If the sample size, n is very very large, it has thinner tail and smaller variance. I think it should be very close to the variance of the population due to very large sample size.

If I compare case 1 and case 2, case 1 has larger variance than case 2 which translates to sample size variance is larger than the population variance. This contradicts with dividing n-1 in the denominator during the calculation of the sample variance.

I think I should be missing some important point here, but I could not figure out what is the problem with this thinking process.

Thanks

Stephen Tashi
leslieg,

You are still not being specific about what population you are talking about. If we have a population P1 and we use a students-T statistic on samples from P1 then (as I mentioned previously) this introduces a second population P2, namely the possible values of the students-T statistic.

Your statements 1) and 2) do not imply "sample size variance is larger than the population variance". It isn't even clear what "sample size variance" means. It isn't clear what population you are talking about.

You would be correct to say that the variance of the population P2 of values of the students-T statistic decreases as the sample size increases. The sample size has no effect whatsoever on the variance of the population P1.

The purpose using n-1 in the quantity S that is part of the students-T statistic is to make the average value of the estimates equal to the actual variance of P1. If you used n instead, you wouldn't change the variance of the population P1. You would change the variance of population P2 as well as change the average value of P2.

Why I think there is contradiction is because:

Let's say for the case of t distribution:

1.) If the sample size, n is small, it has fatter tail and larger variance. It is like I taking sample out of the large population to estimates the variance.
2.) If the sample size, n is very very large, it has thinner tail and smaller variance. I think it should be very close to the variance of the population due to very large sample size.

If I compare case 1 and case 2, case 1 has larger variance than case 2 which translates to sample size variance is larger than the population variance. This contradicts with dividing n-1 in the denominator during the calculation of the sample variance.

I think I should be missing some important point here, but I could not figure out what is the problem with this thinking process.

Thanks
I think the apparent contradiction is because you're interpreting the student's t distribution as the sample variance, which it isn't. Really it's a way of incorporating the sample variance into the central limit theorem, so
$$T=\frac{(\bar{X}_n-\mu)\sqrt{n}}{S_n}$$
should converge to a standard normal random variable as n tends to infinity. If you want to compare something with the sample variance you should look at the Chi-square distribution instead.

Why I think there is contradiction is because:

Let's say for the case of t distribution:

1.) If the sample size, n is small, it has fatter tail and larger variance. It is like I taking sample out of the large population to estimates the variance.
2.) If the sample size, n is very very large, it has thinner tail and smaller variance. I think it should be very close to the variance of the population due to very large sample size.

If I compare case 1 and case 2, case 1 has larger variance than case 2 which translates to sample size variance is larger than the population variance. This contradicts with dividing n-1 in the denominator during the calculation of the sample variance.

I think I should be missing some important point here, but I could not figure out what is the problem with this thinking process.

Thanks
In both case 1 and case 2 you would use the sample variance if the mean is unknown and divide by n-1. If the mean is known, you would divide by n. It does not matter what the actual distribution is or how many samples you have, only whether you know the true value of the mean.

Stephen Tashi
leslieg,

Another way that you are getting tangled up in words is that your are not making a distinction between "changing an expression in a formula" and "changing the number of samples". If we use (n-1) instead of n in a formula for computing a statistic, this does not mean that we changed the number of samples.

If we are precise about the apparent contradiction you raise, I think it can be phrased this way:

In computing the sample variance for a sample of n things from population P1, with a statistic S, we use (n-1) in the denominator for S because using (n) instead would cause the population P2 of the values of S to have a mean less than the variance of P1. However, if we increase the number of samples n then we decrease the variance of the population of P2.

I see no contradictions in the above statements.