# Variance of the union of two samples

1. Jul 17, 2009

### aydos

This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means $\hat{x}$ and $\hat{y}$ and estimated variances $\hat{\sigma_x}$ and $\hat{\sigma_y}$. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?

2. Jul 17, 2009

### John Creighto

For the sum of the two samples the variance is given by:

$$E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})]$$
$$=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2]$$

The standard deviation is she square root of the variance.

If the samples are independent then:

$$2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0$$

Last edited: Jul 17, 2009
3. Jul 17, 2009

### mXSCNT

Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
$$\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2$$
$$= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)$$
Now, this isn't quite in the form we want because of the $$\hat{z}$$. However, $$\hat{z} = (n\hat{x} + m\hat{y})/(n+m)$$. So (I'll just do the x's):
$$\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2$$
$$= \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2$$
$$= \sum_x (x - \hat{x} + c)^2$$ where $$c = - (m\hat{y} - m\hat{x})/(n+m)$$
$$= \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)$$
$$= (n-1)\hat{\sigma}^2_x + 0 + nc^2$$
You can do the same thing for y, and combine them.

Last edited: Jul 17, 2009
4. Jul 17, 2009

Do you mean that you are pooling the two samples, one with sample size $$m$$, the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

Are $$\hat \sigma_x, \hat \sigma_y$$ estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

$$\hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}$$

If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

$$\hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}$$

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"

5. Jul 17, 2009

### mXSCNT

The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.

6. Jul 17, 2009

I realize that: I was unsure of the OP's intent in the question.

7. Jul 18, 2009

### aydos

Thanks for the replies and I am sorry for the confusion in notation, I meant to write $\hat{\sigma}^2_x$ instead of $\hat{\sigma}_x$.

Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

The two problems I am considering have the following assumptions:

1- the population mean is different and the variance is the same
2- both population mean and variance are different

8. Jul 18, 2009

### mXSCNT

Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.