Question about vaiance, population and sample

  • Thread starter robert Ihnot
  • Start date
  • Tags
    population
In summary, the reader in Chat Room of the October edition of Active Trader magazine is questioning why in explaining the variance in the previous month's issue, the example did not divide by two. The explanation given was that is how it is done without providing further reasoning. The distinction between sample deviation and population deviation is not well explained in elementary statistics books. It is said to eliminate bias or make the theory work better. However, there is no good explanation as to why it is better to divide by 2 than by 3 in the sample deviation. The reader suggests using examples with dice to better understand this concept. It is mentioned that dividing by both n and n-1 are unbiased estimators, but they cannot both be unbiased unless the variance
  • #1
robert Ihnot
1,059
1
In the October edition of the magazine, Active Trader, a reader writing in Chat Room, "Deviating from deviation?" asks that in explaining last month the viaiance, why did you not in your example divide by two?

{(8-9)^2 + (9-9)^2 +(10-9)^2}/3 = .667.

The explanation given is nothing more than,'That's how it is done,' and completely ignores, adding, "We're not math majors," the difference between the sample diviation and the population deviation. (There is no explanation of where the above example come from, and probably it is nothing but an equation invented by the writers.)

Elementary statistic books do a very poor job of explaining WHY that difference occurs, saying such as "It eliminates bias," or even "It makes the theory work out better, and isn't worth going into."

Does anyone have a good explanation of why there is that distinction, and assuming it is a sample deviation, why is it better to divide by 2 than by 3?
 
Last edited:
Physics news on Phys.org
  • #2
robert Ihnot said:
Does anyone have a good explanation of why there is that distinction, and assuming it is a sample deviation, why is it better to divide by 2 than by 3?
In the estimators section of your statistics text, you should get, either as a problem or example, a simple calculation that shows the "divide by n" estimator of population variance for variance of samples is biased, while "divide by n-1" is unbiased. Are you looking for an intuitive answer ?
 
  • #3
Well, I have made up several examples about dice, but when the number of trials falls, say three throws of the dice, this greatly changes the variance.

This is my example, the population is the six sides of a dice, mean is 3.5 on a throw, variance is 2.92. Now if we throw three times, and get a perfectly reasonable outcome: 2,3,4. The mean is 3, and dividing by 2 the variance is 1, where as dividing by 3 it would have been 2/3. In neither case are we near 2.94. Thanks, bob
 
  • #4
robert Ihnot said:
Well, I have made up several examples about dice, but when the number of trials falls, say three throws of the dice, this greatly changes the variance.
No, not "on average." Your example is conditional on a given sample. That's not a good basis to verify the expected value of any variance estimator.
 
  • #5
This is my example, the population is the six sides of a dice, mean is 3.5 on a throw, variance is 2.92. Now if we throw three times, and get a perfectly reasonable outcome: 2,3,4. The mean is 3, and dividing by 2 the variance is 1, where as dividing by 3 it would have been 2/3. In neither case are we near 2.94. Thanks, bob

Run this experiment a million times, and look at the average value for the variance that you compute.
 
  • #6
Hurkyl: Run this experiment a million times, and look at the average value for the variance that you compute.

If it is so run, there will not be much difference between dividing by 1,000,000 or 999,999.
 
  • #7
robert Ihnot said:
If it is so run, there will not be much difference between dividing by 1,000,000 or 999,999.
Correct, both "1/n" and "1/(n-1)" are unbiased estimators.
 
  • #8
If it is so run, there will not be much difference between dividing by 1,000,000 or 999,999.

You entirely misunderstand:

You described an experiment where you roll a die three times, and then compute two different estimates for the variance, one where you divide by 3, and one where you divide by 2.

Now, you perform that experiment a million times, and you get a million estimates where you divided by 3, and a million estimates where you divided by 2.

You can then find the average of the divide by 3 estimates, and the average of the divide by 2 estimates. One of them will be (very close to) the actual variance. One will not.


Correct, both "1/n" and "1/(n-1)" are unbiased estimators.

They cannot possibly both be unbiased, unless the variance is zero.

if s/n is, on average (where s is the calculation in the numerator), the variance v, and so is s/(n-1), then we have:

E = nv
E = (n-1)v
0 = v

:tongue2:
 
  • #9
Hurkyl, you are correct. My bad. What I meant was, although the "1/n" estimator is biased, it is consistent.
 
  • #10
Well, I got an answer here in this statistic book, "Principals of Statistics," MG Bulmer, dover paperback, 1967, p130:

"It may seem surprising that the Expected value of the sample variance is slightly less than the population variance. The reason is that sum of the squared deviations of a set of observations from their mean is always less than the sum the squared deviations from the population mean."
 
  • #11
Good work; now you can give advice to the needy in these forums. :smile:
 
  • #12
It may seem surprising that the Expected value of the sample variance is slightly less than the population variance.

If we look at [tex]F(x)=\sum_{i=1}^{i=n}(a_i-x)^2[/tex]

By taking the derivative and setting it equal to 0, we arrive at the minimal value of the function: [tex]nx=\sum_{i=1}^{i=n}a_i[/tex]

Thus letting x take on the value of the mean of the sample gives us the minimal value for the variance.
 
Last edited:
  • #13
It may seem surprising that the Expected value of the sample variance is slightly less than the population variance.

On page 129-130, Principles of Statistics, we have this problem gone into, though here a few additional details are presented. He writes:

[tex](S^2)= \sum(x_i-X)^2=\sum(x_i-\mu)^2-N(X-\mu)^2[/tex]
for the above S^2 is as defined, N is the number of samples, mu is the mean, X is the sample mean, each X_i represents a variable that takes on various sample values.

Now the point is to find the expectation, E. We have:
[tex]E(\sum(x_i-\mu)^2 = N\sigma^2[/tex], where sigma is the STD.

For the second term, E(x)=mu, and we have [tex]N*E(X-\mu)^2=N*(EX^2-(EX)^2)
[/tex]
The later term after N is [tex]V(X)=\frac{V(NX)=\sum V(X_i)=N\sigma^2}{N^2}[/tex]

Thus returning to the original equation we have:
[tex]E(S^2)=N\sigma^2-\sigma^2=(N-1)\sigma^2. [/tex]

Author adds: "Because of this fact S^2 is often divided by N-1 instead of N in order to obtain an unbiased estimate..."
 
Last edited:

1) What is variance?

Variance is a statistical measure that describes the spread of data points around the mean or average of a data set. It is calculated by taking the average of the squared differences between each data point and the mean.

2) What is the difference between population and sample variance?

Population variance is a measure of the variability of a population, while sample variance is a measure of the variability of a sample taken from a population. Population variance is calculated using all of the data points in a population, while sample variance is calculated using a subset of the data from a population.

3) How is variance related to standard deviation?

Variance and standard deviation are both measures of the spread of data, with variance being the square of the standard deviation. Standard deviation is often preferred over variance because it is in the same units as the data, making it easier to interpret.

4) Can variance be negative?

No, variance cannot be negative. It is always a non-negative value, as it is calculated by squaring the differences between data points and the mean. A negative value would indicate that the data points are not spread out around the mean, but rather clustered together.

5) How can variance be used in data analysis?

Variance is a useful measure in data analysis as it provides insight into the variability of a data set. It can help identify outliers and understand the spread of data points, which can inform decision making and further analysis. It is also used in many statistical tests and calculations to determine the significance of results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
15
Views
4K
  • Calculus and Beyond Homework Help
Replies
2
Views
1K
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
3K
  • Calculus and Beyond Homework Help
Replies
1
Views
2K
Back
Top