Comparing Sample Means with Pooled Variance Estimate

Click For Summary
In hypothesis testing for comparing two sample means with unknown but equal variances, the pooled variance estimate uses (n1-1) and (n2-1) in the numerator to avoid bias in estimating the population variance. Using N instead of N-1 leads to an underestimation of the population variance, as sample variance calculated with N is biased. The adjustment to N-1 corrects this bias, ensuring that the expected value of the sample variance equals the true population variance. This adjustment is crucial for accurate statistical inference. Therefore, employing the pooled variance formula with (n1-1) and (n2-1) is the preferred method in these scenarios.
adamg
Messages
48
Reaction score
0
Suppose you are conducting a hypothesis test to compare two sample means from independent samples, with the variance unknown, but you know it is the same for both populations. Then you use the pooled estimate of the variance given by [ (n1 - 1)s1^2 + (n2-1)s2^2 ] / (n1+n2-2)

I was just wondering why we use (n1-1) etc instead of using n1 and n2 and then dividing by n1 + n2?

thanks
 
Mathematics news on Phys.org
adamg said:
Suppose you are conducting a hypothesis test to compare two sample means from independent samples, with the variance unknown, but you know it is the same for both populations. Then you use the pooled estimate of the variance given by [ (n1 - 1)s1^2 + (n2-1)s2^2 ] / (n1+n2-2)

I was just wondering why we use (n1-1) etc instead of using n1 and n2 and then dividing by n1 + n2?

thanks

When you calculate the sample variance using "(sum of squared difference from mean)/N" then it turns out that this gives a biased estimate of the population variance (and it's square-root a biased estimate of the population standard deviation). Replacing "N" with "N-1" gives an unbiased estimate of the population variance and standard deviation so it's usually preferred. Unfortunately there is often a bit of ambiguity whenever sample var and sd are discussed as there doesn't seem to be a universal standard of whether to use "N" or "N-1" in the definition.

In your example above I assume that s1^2 and s2^2 are based on the "N-1" calculations.
 
Last edited:
Here is the above in a bit more detail :

\[ s_n^2 = 1/n \sum (x_i-\bar{x})^2 \]

\[= 1/n \sum [ ( (x_i-\mu) - (\bar{x}-\mu) )^2 ]\]

\[= 1/n \sum [(x_i-\mu)^2 - 2 (x-\mu)(\bar{x}-\mu) + (\bar{x}-\mu)^2 ] \]

\[= 1/n \sum [(x_i-u)^2)] - (\bar{x}-u)^2 \]

So,

\[E[s_n^2] = 1/n \sum E[(x_i-u)^2)] - E[(\bar{x}-u)^2 ] \]

\[= E[(x-u)^2)] - E[(\bar{x}-u)^2 ] \]

\[= \sigma^2 - \{\rm{term\ greater\ than\ or\ equal\ zero}\} \]

This shows that the sample variance s_n^2 always under-estimates the population variance \sigma^2.

You can further show (assuming all the samples are independant) that E[(\bar{x}-u)^2 ] is equal to $\sigma^2/n and hence,

\[s_n^2 = \sigma^2 - \sigma^2/n = \frac{n-1}{n} \sigma^2 \].

So not only is s_n^2 a biased estimator of \sigma^2 it is too small by a factor of precisely (n-1)/n. Clearly using (n-1) instead of (n) in the denominator fixes this and makes the expectation of this modified sample variance (E(s_{n-1}^2)) equal to the population variance (\sigma^2).
 
Last edited:
Here is a little puzzle from the book 100 Geometric Games by Pierre Berloquin. The side of a small square is one meter long and the side of a larger square one and a half meters long. One vertex of the large square is at the center of the small square. The side of the large square cuts two sides of the small square into one- third parts and two-thirds parts. What is the area where the squares overlap?

Similar threads

  • · Replies 1 ·
Replies
1
Views
1K
Replies
1
Views
1K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 23 ·
Replies
23
Views
4K