Variance of the union of two samples

  • Context: Undergrad 
  • Thread starter Thread starter aydos
  • Start date Start date
  • Tags Tags
    Union Variance
Click For Summary

Discussion Overview

The discussion revolves around estimating the variance of a combined sample from two separate bags of samples, each with their own estimated means and variances. Participants explore different statistical approaches and assumptions regarding the populations from which the samples are drawn, including considerations of independence and the relationship between the samples.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant poses a question about how to estimate the variance of a combined sample using the means and variances of two separate samples.
  • Another participant provides a mathematical expression for the variance of the sum of two independent samples, noting that if the samples are independent, the covariance term becomes zero.
  • A different participant suggests an estimator for the variance of the combined sample, introducing a formula that incorporates the means of the individual samples.
  • Some participants clarify the distinction between pooling samples and estimating a common population variance, raising questions about the notation used for variance and standard deviation.
  • One participant emphasizes that the pooled variance assumes the samples come from different populations with the same variance, while another equation assumes the samples come from the same population.
  • Another participant expresses uncertainty about the original poster's intent and the implications of different assumptions on the solutions provided.
  • A participant illustrates a scenario with specific values to highlight how the variance might differ based on the relationship between the populations from which the samples are drawn.

Areas of Agreement / Disagreement

Participants express differing views on the assumptions regarding the populations from which the samples are drawn, particularly concerning whether the means and variances are the same or different. There is no consensus on a single approach to estimating the variance of the combined sample.

Contextual Notes

Participants note that the solutions vary based on assumptions about the populations, including whether the means and variances are the same or different. The discussion highlights the complexity of combining statistical estimates and the importance of clearly defining assumptions.

aydos
Messages
19
Reaction score
2
This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means \hat{x} and \hat{y} and estimated variances \hat{\sigma_x} and \hat{\sigma_y}. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?
 
Physics news on Phys.org
For the sum of the two samples the variance is given by:

E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})]
=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2]

The standard deviation is she square root of the variance.

If the samples are independent then:

2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0
 
Last edited:
Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2
= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)
Now, this isn't quite in the form we want because of the \hat{z}. However, \hat{z} = (n\hat{x} + m\hat{y})/(n+m). So (I'll just do the x's):
\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2
= \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2
= \sum_x (x - \hat{x} + c)^2 where c = - (m\hat{y} - m\hat{x})/(n+m)
= \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)
= (n-1)\hat{\sigma}^2_x + 0 + nc^2
You can do the same thing for y, and combine them.
 
Last edited:
Do you mean that you are pooling the two samples, one with sample size m, the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

Are \hat \sigma_x, \hat \sigma_y estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

<br /> \hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}<br />

If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

<br /> \hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}<br />

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
 
statdad said:
Do you mean that you are pooling the two samples, one with sample size m , the other with sample size n?, and you want to use the information to estimate a common population variance? If so, read on.

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.
 
I realize that: I was unsure of the OP's intent in the question.
 
Thanks for the replies and I am sorry for the confusion in notation, I meant to write \hat{\sigma}^2_x instead of \hat{\sigma}_x.

Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

The two problems I am considering have the following assumptions:

1- the population mean is different and the variance is the same
2- both population mean and variance are different
 
Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
4K
  • · Replies 17 ·
Replies
17
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 7 ·
Replies
7
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 13 ·
Replies
13
Views
3K
  • · Replies 5 ·
Replies
5
Views
3K