Variance of the union of two samples

  • Thread starter Thread starter aydos
  • Start date Start date
  • Tags Tags
    Union Variance
aydos
Messages
19
Reaction score
2
This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means \hat{x} and \hat{y} and estimated variances \hat{\sigma_x} and \hat{\sigma_y}. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?
 
Physics news on Phys.org
For the sum of the two samples the variance is given by:

E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})]
=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2]

The standard deviation is she square root of the variance.

If the samples are independent then:

2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0
 
Last edited:
Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2
= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)
Now, this isn't quite in the form we want because of the \hat{z}. However, \hat{z} = (n\hat{x} + m\hat{y})/(n+m). So (I'll just do the x's):
\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2
= \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2
= \sum_x (x - \hat{x} + c)^2 where c = - (m\hat{y} - m\hat{x})/(n+m)
= \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)
= (n-1)\hat{\sigma}^2_x + 0 + nc^2
You can do the same thing for y, and combine them.
 
Last edited:
Do you mean that you are pooling the two samples, one with sample size m, the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

Are \hat \sigma_x, \hat \sigma_y estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

<br /> \hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}<br />

If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

<br /> \hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}<br />

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
 
statdad said:
Do you mean that you are pooling the two samples, one with sample size m , the other with sample size n?, and you want to use the information to estimate a common population variance? If so, read on.

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.
 
I realize that: I was unsure of the OP's intent in the question.
 
Thanks for the replies and I am sorry for the confusion in notation, I meant to write \hat{\sigma}^2_x instead of \hat{\sigma}_x.

Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

The two problems I am considering have the following assumptions:

1- the population mean is different and the variance is the same
2- both population mean and variance are different
 
Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.
 
Back
Top