Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Variance of the union of two samples

  1. Jul 17, 2009 #1
    This seems like a simple problem but I cannot find an answer. Imagine I have 2 bags of samples with estimated means [itex] \hat{x} [/itex] and [itex] \hat{y}[/itex] and estimated variances [itex] \hat{\sigma_x} [/itex] and [itex] \hat{\sigma_y}[/itex]. The bags contain n and m samples respectivelly. Now assume I mix all the samples in a single bag.

    Question: How do I estimate the variance of all the samples in this larger bag only using the statistics above?
     
  2. jcsd
  3. Jul 17, 2009 #2
    For the sum of the two samples the variance is given by:

    [tex]E[(x+y-\hat{\sigma_x} -\hat{\sigma_y} )(x+y-\hat{\sigma_x} -\hat{\sigma_y})] [/tex]
    [tex]=E[(x-\hat{\sigma_x})^2]+2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]+E[(y-\hat{\sigma_y})^2][/tex]

    The standard deviation is she square root of the variance.

    If the samples are independent then:

    [tex]2E[(x-\hat{\sigma_x})(y-\hat{\sigma_y})]=0[/tex]
     
    Last edited: Jul 17, 2009
  4. Jul 17, 2009 #3
    Well, let's assume that both samples are from the same distribution. Let z be the combined sample. An estimator of the variance of the distribution based on the combined sample is:
    [tex]\hat{\sigma}^2_z = 1/(n+m-1) \sum_z (z-\hat{z})^2[/tex]
    [tex]= 1/(n+m-1)(\sum_x (x-\hat{z})^2 + \sum_y (y-\hat{z})^2)[/tex]
    Now, this isn't quite in the form we want because of the [tex]\hat{z}[/tex]. However, [tex]\hat{z} = (n\hat{x} + m\hat{y})/(n+m)[/tex]. So (I'll just do the x's):
    [tex]\sum_x (x-\hat{z})^2 = \sum_x (x - (n\hat{x} + m\hat{y})/(n+m))^2[/tex]
    [tex] = \sum_x (x - \hat{x} + \hat{x} - (n\hat{x} + m\hat{y})/(n+m))^2[/tex]
    [tex] = \sum_x (x - \hat{x} + c)^2[/tex] where [tex]c = - (m\hat{y} - m\hat{x})/(n+m)[/tex]
    [tex] = \sum_x ((x - \hat{x})^2 + 2c(x - \hat{x}) + c^2)[/tex]
    [tex] = (n-1)\hat{\sigma}^2_x + 0 + nc^2[/tex]
    You can do the same thing for y, and combine them.
     
    Last edited: Jul 17, 2009
  5. Jul 17, 2009 #4

    statdad

    User Avatar
    Homework Helper

    Do you mean that you are pooling the two samples, one with sample size [tex] m [/tex], the other with sample size [n]?, and you want to use the information to estimate a common population variance? If so, read on.

    Are [tex] \hat \sigma_x, \hat \sigma_y [/tex] estimates of the variances or the [/B] standard deviations? - your notation is a little odd.

    If you have estimates of variance, and are simply pooling (combining) the two samples, you can use the classical pooled variance (with your notation)

    [tex]
    \hat \sigma = \frac{(n-1)\hat \sigma_x + (m-1) \hat \sigma_y}{n+m-2}
    [/tex]

    If your estimates are for the standard deviations rather than the variance, the appropriate calculation would be

    [tex]
    \hat \sigma^2 = \frac{(n-1) \hat \sigma x^2 + (m-1) \hat \sigma_y^2}{m+n-2}
    [/tex]

    Notice that this is an entirely different question than that the others answered: they took you to be asking "How can I use the summary information in these two samples to determine the variance I would get by combining all data into one large sample?"
     
  6. Jul 17, 2009 #5
    The difference between the pooled variance (your method) and the equation I gave is that the pooled variance assumes the two samples actually come from different populations, with different population means but the same population variance. The equation I gave assumes that the samples are from the same population, with the same population mean and variance.
     
  7. Jul 17, 2009 #6

    statdad

    User Avatar
    Homework Helper

    I realize that: I was unsure of the OP's intent in the question.
     
  8. Jul 18, 2009 #7
    Thanks for the replies and I am sorry for the confusion in notation, I meant to write [itex] \hat{\sigma}^2_x[/itex] instead of [itex] \hat{\sigma}_x[/itex].

    Both replies from mXSCNT and statdad were very useful. I did not realize the solutions are different for different assumptions about the populations. I still haven't fully understood why the solutions are different though?

    The two problems I am considering have the following assumptions:

    1- the population mean is different and the variance is the same
    2- both population mean and variance are different
     
  9. Jul 18, 2009 #8
    Well, suppose that the x's are {1, 2}, and the y's are {101, 102}.

    If x and y were taken from populations with different means, then you'd expect the variance of those populations to be quite small (since x and y each have variances of only 1/2). On the other hand if x and y were taken from populations with the same mean, then you'd expect the variance of that population to be comparatively large, since x and y are separated by about 100 units.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook