View Single Post
Stephen Tashi
Oct27-11, 11:07 AM
Sci Advisor
P: 3,313
The mean and variance of a sample, are formulas which have standard definitions. The field of study that states these definitions is "descriptive statistics". If you give numbers for the mean and variance of a sample, then people will assume you obeyed these formulas - or used formulas which give exactly the same numerical answers. It's merely a matter of obeying standard conventions.

When you want to use the numbers in a sample to estimate the mean and variance of a population (or a "random variable") there are no set rules for what formula you can use. What you do will depend on what you know about the distribution of the population.

There are three different concepts involved:
1) The properties of the sample ( such as its mean and variance)
2) The properties of the population ( such its mean and variance)
3) The formulae or procedures that you apply to the data in the sample to estimate the properties of the population.

For example, suppose the population is defined by a random variable X that has a discrete distribution with two unknown parameters A and B. Suppose we know that X has only 3 values with non-zero probabilities and that these are given by:

probability that X = M + A is 1/3
probabiltiy that X = M - A is 1/3
probabiltiy that X = M is 1/3

Suppose we take a sample of 4 random draws from this distribution and the results are:
{ -3, 1, 5, 5 }. Then we know "by inspection" that M = 1 and A = 4. The mean of the population is therefore 1. (There is a standard definition for the mean of a distribution and if you apply it to the above list of probabilities, using M = 1 and A = 4, you get that the mean is 1.)

However, if you state that you have computed the mean of the sample , this tells people that you are stating the number ( -3 + 1 + 5 + 5)/ 4. You aren't supposed to say that mean of the sample is 1 even though you know that the sample implies that the mean of the population is 1.

Suppose you have a sample of N values of the random variable X and let the sample mean be [itex] \bar x [/itex]. I'm not an expert in descriptive statistics, but I think that if you state a number for the sample variance, it is always suposed to be the number:

[tex] \frac {\sum (x - \bar x)^2}{N} [/tex]

and not the number:

[tex] \frac {\sum (x - \bar x)^2}{N-1} [/tex]

If you are estimating the variance of the population, you are free to use the latter formula and people advocate doing this when N is "small". To understand why, you have to study the statistical theory of "estimators".


So if I want to pool 5 groups together and calculate the total mean and std, I can either use the equation above on all data points Ʃni, or I can use the equation for pooled std and the mean will be

x_bar = Ʃ(xi_bar*ni)/(Ʃni-1), right?
No. You wouldn't divide by [itex] \sum n_i - 1. [/itex] Divide by [itex] \sum n_i [/itex].