Variance of the union of two samples

aydos · Jul 17, 2009

This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means [itex]\hat{x}[/itex] and [itex]\hat{y}[/itex] and estimated variances [itex]\hat{\sigma_x}[/itex] and [itex]\hat{\sigma_y}[/itex]. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?

John Creighto · Jul 17, 2009

For the sum of the two samples the variance is given by:

[tex]E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})][/tex]
[tex]=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2][/tex]

The standard deviation is she square root of the variance.

If the samples are independent then:

[tex]2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0[/tex]

mXSCNT · Jul 17, 2009

Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
[tex]\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2[/tex]
[tex]= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)[/tex]
Now, this isn't quite in the form we want because of the [tex]\hat{z}[/tex]. However, [tex]\hat{z} = (n\hat{x} + m\hat{y})/(n+m)[/tex]. So (I'll just do the x's):
[tex]\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2[/tex]
[tex]= \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2[/tex]
[tex]= \sum_x (x - \hat{x} + c)^2[/tex] where [tex]c = - (m\hat{y} - m\hat{x})/(n+m)[/tex]
[tex]= \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)[/tex]
[tex]= (n-1)\hat{\sigma}^2_x + 0 + nc^2[/tex]
You can do the same thing for y, and combine them.

statdad · Jul 17, 2009

Do you mean that you are pooling the two samples, one with sample size [tex]m[/tex], the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

Are [tex]\hat \sigma_x, \hat \sigma_y[/tex] estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

[tex] \hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}[/tex]

If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

[tex] \hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}[/tex]

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"

mXSCNT · Jul 17, 2009

statdad said:

Do you mean that you are pooling the two samples, one with sample size m , the other with sample size n?, and you want to use the information to estimate a common population variance? If so, read on.

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"

The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.

statdad · Jul 17, 2009

I realize that: I was unsure of the OP's intent in the question.

aydos · Jul 18, 2009

Thanks for the replies and I am sorry for the confusion in notation, I meant to write [itex]\hat{\sigma}^2_x[/itex] instead of [itex]\hat{\sigma}_x[/itex].

Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

The two problems I am considering have the following assumptions:

1- the population mean is different and the variance is the same
2- both population mean and variance are different

mXSCNT · Jul 18, 2009

Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.

Variance of the union of two samples

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect