Variance of the union of two samples

aydos · Jul 17, 2009

This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means \hat{x} and \hat{y} and estimated variances \hat{\sigma_x} and \hat{\sigma_y}. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?

John Creighto · Jul 17, 2009

For the sum of the two samples the variance is given by:

E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})]
=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2]

The standard deviation is she square root of the variance.

If the samples are independent then:

2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0

mXSCNT · Jul 17, 2009

Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2
= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)
Now, this isn't quite in the form we want because of the \hat{z}. However, \hat{z} = (n\hat{x} + m\hat{y})/(n+m). So (I'll just do the x's):
\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2
= \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2
= \sum_x (x - \hat{x} + c)^2 where c = - (m\hat{y} - m\hat{x})/(n+m)
= \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)
= (n-1)\hat{\sigma}^2_x + 0 + nc^2
You can do the same thing for y, and combine them.

statdad · Jul 17, 2009

Do you mean that you are pooling the two samples, one with sample size m, the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

Are \hat \sigma_x, \hat \sigma_y estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

<br /> \hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}<br />

If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

<br /> \hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}<br />

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"

mXSCNT · Jul 17, 2009

statdad said:

Do you mean that you are pooling the two samples, one with sample size m , the other with sample size n?, and you want to use the information to estimate a common population variance? If so, read on.

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"

The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.

statdad · Jul 17, 2009

I realize that: I was unsure of the OP's intent in the question.

aydos · Jul 18, 2009

Thanks for the replies and I am sorry for the confusion in notation, I meant to write \hat{\sigma}^2_x instead of \hat{\sigma}_x.

Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

The two problems I am considering have the following assumptions:

1- the population mean is different and the variance is the same
2- both population mean and variance are different

mXSCNT · Jul 18, 2009

Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.

Variance of the union of two samples

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective