Variance of the union of two samples

In summary: In this example, the pooled variance is small, while the other variance estimate is large.Does this clarify the difference between the two assumptions?In summary, the conversation discusses estimating the variance of a larger bag containing mixed samples from two smaller bags with different estimated means and variances. The solution depends on the assumption about the populations, with a pooled variance formula for different means and the same variance, and a different formula for the same mean and different variances.
  • #1
aydos
19
2
This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means [itex] \hat{x} [/itex] and [itex] \hat{y}[/itex] and estimated variances [itex] \hat{\sigma_x} [/itex] and [itex] \hat{\sigma_y}[/itex]. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?
 
Physics news on Phys.org
  • #2
For the sum of the two samples the variance is given by:

[tex]E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})] [/tex]
[tex]=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2][/tex]

The standard deviation is she square root of the variance.

If the samples are independent then:

[tex]2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0[/tex]
 
Last edited:
  • #3
Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
[tex]\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2[/tex]
[tex]= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)[/tex]
Now, this isn't quite in the form we want because of the [tex]\hat{z}[/tex]. However, [tex]\hat{z} = (n\hat{x} + m\hat{y})/(n+m)[/tex]. So (I'll just do the x's):
[tex]\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2[/tex]
[tex] = \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2[/tex]
[tex] = \sum_x (x - \hat{x} + c)^2[/tex] where [tex]c = - (m\hat{y} - m\hat{x})/(n+m)[/tex]
[tex] = \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)[/tex]
[tex] = (n-1)\hat{\sigma}^2_x + 0 + nc^2[/tex]
You can do the same thing for y, and combine them.
 
Last edited:
  • #4
Do you mean that you are pooling the two samples, one with sample size [tex] m [/tex], the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

Are [tex] \hat \sigma_x, \hat \sigma_y [/tex] estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

[tex]
\hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}
[/tex]

If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

[tex]
\hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}
[/tex]

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
 
  • #5
statdad said:
Do you mean that you are pooling the two samples, one with sample size m , the other with sample size n?, and you want to use the information to estimate a common population variance? If so, read on.

Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.
 
  • #6
I realize that: I was unsure of the OP's intent in the question.
 
  • #7
Thanks for the replies and I am sorry for the confusion in notation, I meant to write [itex] \hat{\sigma}^2_x[/itex] instead of [itex] \hat{\sigma}_x[/itex].

Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

The two problems I am considering have the following assumptions:

1- the population mean is different and the variance is the same
2- both population mean and variance are different
 
  • #8
Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.
 

What is the "Variance of the union of two samples"?

The "Variance of the union of two samples" refers to the mathematical measure of the variability or spread of data when combining two samples together. It takes into account the differences between the two samples and how they affect the overall variance.

Why is the "Variance of the union of two samples" important?

The "Variance of the union of two samples" is important because it helps us understand the variability of data when combining two samples. It allows us to see how much the two samples differ from each other and how that affects the overall variance. This information is useful in statistical analyses and making predictions.

How is the "Variance of the union of two samples" calculated?

The "Variance of the union of two samples" is calculated by taking the sum of the variances of the two individual samples and adding the covariance between the two samples. This can be expressed mathematically as Var(X ∪ Y) = Var(X) + Var(Y) + 2Cov(X,Y).

What is the difference between "Variance of the union of two samples" and "Variance of the intersection of two samples"?

The "Variance of the union of two samples" measures the variability when combining two samples, while the "Variance of the intersection of two samples" measures the variability when the two samples have overlapping data. In other words, the former takes into account the differences between the two samples, while the latter takes into account the similarities.

How can the "Variance of the union of two samples" be used in practical applications?

The "Variance of the union of two samples" can be used in practical applications to understand the variability of data in situations where two samples are combined. For example, in market research, it can help assess the variability of consumer preferences when two different products are introduced. In manufacturing, it can help understand the variability of product quality when two different production processes are used.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
462
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
935
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
918
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
Back
Top