Finding standard deviation of combination of data

AI Thread Summary
The discussion revolves around finding the standard deviation of a combined dataset from two groups, A and B, where the number of data points in each group differs. Participants clarify that the interpretation of "A+B" is crucial, as it could refer to either the sum of random variables or the union of sets. For a valid solution, it's suggested that the groups should come from the same population with the same mean, and the standard deviation can be calculated using the combined variance formula. However, without knowing the specific values or means of the datasets, a numerical answer cannot be determined. The conversation emphasizes the importance of clear definitions and assumptions in statistical calculations.
songoku
Messages
2,467
Reaction score
382
Homework Statement
Group A has standard deviation of 10 and group B has standard deviation of 20. If group A has 150 data and group B has 100 data, what is the standard deviation of A + B?
Relevant Equations
##\sigma^2=\frac{1}{n}\left(\Sigma x^2 -\frac{(\Sigma x)^2}{n}\right)##
I tried some workings but got me nowhere. I just want to ask whether this question is solvable, i.e the answer can be in numerical value. If yes, then I want to try a bit by myself before asking for hint here.

Thanks
 
Physics news on Phys.org
To be clear, by A+B, I assume you mean some set of data ##\{ c_i = a_i + b_i | a_i \in A,\ b_i \in B\}##.
In that case, the correlation between the ##a_i##s and associated ##b_i##s must be considered.
The general equation is Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##) +2 Cov(##X_A,\ Y_B##).
For uncorrelated random variables, ##X_A## and ##Y_B##, this becomes Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##)
 
  • Like
Likes WWGD and songoku
FactChecker said:
To be clear, by A+B, I assume you mean some set of data ##\{ c_i = a_i + b_i | a_i \in A,\ b_i \in B\}##.
In that case, the correlation between the ##a_i##s and associated ##b_i##s must be considered.
The general equation is Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##) +2 Cov(##X_A,\ Y_B##).
For uncorrelated random variables, ##X_A## and ##Y_B##, this becomes Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##)
Ah I see, so basically this question not really making sense because the number of data in each group is not the same so A + B will result in some data in A has no match for data in B.

If the question is modified into finding the standard deviation if the data in A is combined with data in B (so now the total data is 250), can we solve it? Actually this is the one I tried and got stuck (so I thought maybe the information of the question is not enough)

Thanks
 
songoku said:
Ah I see, so basically this question not really making sense because the number of data in each group is not the same so A + B will result in some data in A has no match for data in B.

If the question is modified into finding the standard deviation if the data in A is combined with data in B (so now the total data is 250), can we solve it? Actually this is the one I tried and got stuck (so I thought maybe the information of the question is not enough)

Thanks
It can be solved if we assume that the groups are taken from the same population and have the same mean.
 
songoku said:
Ah I see, so basically this question not really making sense because the number of data in each group is not the same so A + B will result in some data in A has no match for data in B.
The first problem is that the meaning of "A+B" is undefined, or at least not clear to me. Do you mean the sum of random variables, ##X_A##, from A and ##X_B##, from B? In that case, you need to know which of the A samples match up and sum with which of the B samples.
songoku said:
If the question is modified into finding the standard deviation if the data in A is combined with data in B (so now the total data is 250), can we solve it? Actually this is the one I tried and got stuck (so I thought maybe the information of the question is not enough)
So you are talking about drawing samples of a random variable, X, from the union of A and B, ##A \cup B##. Are the samples drawn randomly uniformly from ##A \cup B##?
In that case, you should be able to use the standard equation for ##\sigma^2## that you gave above. Apply it to the entire 250 elements. Why do you say that it didn't work?
 
  • Like
Likes songoku and WWGD
Maybe to clarify , are these samples from two populations A, B, or do these describe the whole population of interest?
You may do some tests to determine if the data comes from different populations. I believe the Wilcoxon rank test is one such non-parametric test.
 
What about the property ## \sigma_{A+B}^2 = \sigma_A^2 + \sigma_B^2 ## ?
 
  • Like
Likes FactChecker
Gavran said:
What about the property ## \sigma_{A+B}^2 = \sigma_A^2 + \sigma_B^2 ## ?
The OP defines A and B as sets. So A+B is not the sum of random variables. It is the sum of sets, whatever that means.
If you are talking about the sum of random variables, the formula is ##\sigma_{X+Y}^2 = \sigma_{X}^2 +\sigma_{Y}^2 + 2 cov(X,Y)##. Your "property" is wrong in general and only right for uncorrelated variables.
On the other hand, if you are talking about the union of sets, ##C=A\cup B##, with a random variable, ##X##, drawn with uniform distribution from ##C##, then it is still wrong. Consider the single-element sets ##A=\{0\}, B=\{100\}##. Clearly, ##\sigma_A = \sigma_B = 0## but ##\sigma_C = 50##.
 
FactChecker said:
The first problem is that the meaning of "A+B" is undefined, or at least not clear to me. Do you mean the sum of random variables, ##X_A##, from A and ##X_B##, from B? In that case, you need to know which of the A samples match up and sum with which of the B samples.

So you are talking about drawing samples of a random variable, X, from the union of A and B, ##A \cup B##. Are the samples drawn randomly uniformly from ##A \cup B##?
In that case, you should be able to use the standard equation for ##\sigma^2## that you gave above. Apply it to the entire 250 elements. Why do you say that it didn't work?
I am not really sure how to interpret the question. I posted the exact question, word by word.

In my opinion, it makes more sense if the interpretation is not the sum of random variables but maybe sum of sets. Group A has 150 data with standard deviation of 10 and group B has standard deviation of 20 with 100 data. Let say I combine all data into one set, set C, so this set contains 250 data and I want to find the standard deviation of C.

This is what I did:
For group A:
$$\sigma_{a}^{2}=\frac{1}{n_a} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{n_a}\right)$$
$$100=\frac{1}{150} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{150}\right)$$
$$\Sigma a^2=15000+\frac{(\Sigma a)^2}{150}....(1)$$

For group B:
$$\sigma_{b}^{2}=\frac{1}{n_b} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{n_b}\right)$$
$$400=\frac{1}{100} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{100}\right)$$
$$\Sigma b^2=40000+\frac{(\Sigma b)^2}{100}....(2)$$

For group C:
$$\sigma_{c}^{2}=\frac{1}{n_c} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)$$
$$=\frac{1}{250} \left(\Sigma a^2 +\Sigma b^2 - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$
$$=\frac{1}{250}\left(15000+\frac{(\Sigma a)^2}{150} + 40000+\frac{(\Sigma b)^2}{100} - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$

Then I stuck.

Thanks
 
  • #10
songoku said:
I am not really sure how to interpret the question. I posted the exact question, word by word.

In my opinion, it makes more sense if the interpretation is not the sum of random variables but maybe sum of sets. Group A has 150 data with standard deviation of 10 and group B has standard deviation of 20 with 100 data. Let say I combine all data into one set, set C, so this set contains 250 data and I want to find the standard deviation of C.

This is what I did:
For group A:
$$\sigma_{a}^{2}=\frac{1}{n_a} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{n_a}\right)$$
$$100=\frac{1}{150} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{150}\right)$$
$$\Sigma a^2=15000+\frac{(\Sigma a)^2}{150}....(1)$$

For group B:
$$\sigma_{b}^{2}=\frac{1}{n_b} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{n_b}\right)$$
$$400=\frac{1}{100} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{100}\right)$$
$$\Sigma b^2=40000+\frac{(\Sigma b)^2}{100}....(2)$$

For group C:
$$\sigma_{c}^{2}=\frac{1}{n_c} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)$$
$$=\frac{1}{250} \left(\Sigma a^2 +\Sigma b^2 - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$
$$=\frac{1}{250}\left(15000+\frac{(\Sigma a)^2}{150} + 40000+\frac{(\Sigma b)^2}{100} - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$

Then I stuck.

Thanks
Have you tried the approach of the post #4 or you find it unreasonable?
 
  • #11
Hill said:
Have you tried the approach of the post #4 or you find it unreasonable?
Oh I did that and I got ##\sqrt{220}## as the answer. I thought Factchecker was talking about something else, not using the assumption in post#4.

Thanks
 
Last edited:
  • #12
songoku said:
I am not really sure how to interpret the question. I posted the exact question, word by word.

In my opinion, it makes more sense if the interpretation is not the sum of random variables but maybe sum of sets. Group A has 150 data with standard deviation of 10 and group B has standard deviation of 20 with 100 data. Let say I combine all data into one set, set C, so this set contains 250 data and I want to find the standard deviation of C.


For group C:
$$\sigma_{c}^{2}=\frac{1}{n_c} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)$$
$$=\frac{1}{250} \left(\Sigma a^2 +\Sigma b^2 - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$
$$=\frac{1}{250}\left(15000+\frac{(\Sigma a)^2}{150} + 40000+\frac{(\Sigma b)^2}{100} - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$

Then I stuck.
You are not stuck. You are done.
I can not checked your arithmetic for group C, but that is the correct approach (given certain assumptions about what your question means)
If you want to consider the combined set C as the entire population of possible values of a random variable drawn uniformly from C, then you have calculated the variance of that random variable.
If you want to consider the combined set C as the set of sample results, then you should make one change to your equation. It should be ##\sigma_{c}^{2}=\frac{1}{n_c -1} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)##. The divisor is reduced by 1 because the population mean is being estimated.


PS. When you combine two sets into one, IMO, you should use the union symbol, ##A \cup B##, rather than a plus sign.
 
Last edited:
  • #13
Oh ok, it means I can't get the answer in numerical value.

Thank you very much for the help and explanation FactChecker, Hill, WWGD, Gavran
 
  • #14
songoku said:
Oh ok, it means I can't get the answer in numerical value.

Thank you very much for the help and explanation FactChecker, Hill, WWGD, Gavran
Oh, wait! I thought that you had the values of the summations of all the elements in ##A \cup B##. Don't you have that? How did you get the means of A and B?
 
  • #15
FactChecker said:
Oh, wait! I thought that you had the values of the summations of all the elements in ##A \cup B##. Don't you have that? How did you get the means of A and B?
I posted all the questions in OP, that's everything. I don't know the values of the summations of all the elements in ##A \cup B## and I don't have the means of A and B.
 
  • #16
songoku said:
I posted all the questions in OP, that's everything. I don't know the values of the summations of all the elements in ##A \cup B## and I don't have the means of A and B.
Sorry, I misunderstood.

Interpreting A+B as ##A \cup B##:
There is no way to solve it. Consider three simpler problems, all with the same individual 0 (or undefined, if you wish) standard deviations for ##A## and ##B## but significantly different standard deviations for ##A \cup B##:
1) A={0}, B={1}. ##\sigma_{sample A\cup B} = 0.70710678## and ##\sigma_{population A\cup B} = 0.5##
2) A={0}, B={10}. ##\sigma_{sample A\cup B} = 7.0710678## and ##\sigma_{population A\cup B} = 5##
3) A={0}, B={100}. ##\sigma_{sample A\cup B} = 70.710678## and ##\sigma_{population A\cup B} = 50##

If you don't like the 0 or undefined standard deviations for single-element sets A and B, you can easily make multiple-element examples.

Interpreting A+B as ##\{a+b| a\in A, b\in B, \text {selected independently and randomly}\}##:
Then apply ##\sigma_{A+B}^2 = \sigma_A^2 + \sigma_B^2## as @Gavran stated in post #7.
Since this is the only interpretation of A+B with a solution, it is probably the correct interpretation.
 
Last edited:
  • #17
I understand.

Thank you very much FactChecker
 
Back
Top