Finding standard deviation of combination of data

Click For Summary
SUMMARY

The discussion centers on calculating the standard deviation of a combined dataset from two groups, A and B, with differing sizes and standard deviations. The participants clarify that the expression "A+B" is ambiguous and should be interpreted as the union of sets, denoted as A ∪ B. The correct approach to find the standard deviation of the combined dataset C, which contains 250 elements, involves using the formula for variance that accounts for both groups' variances and their covariance, specifically Var(X_A + Y_B) = Var(X_A) + Var(Y_B) + 2Cov(X_A, Y_B). The conversation concludes that without specific summation values for A and B, a numerical answer cannot be derived.

PREREQUISITES
  • Understanding of variance and standard deviation concepts
  • Familiarity with statistical notation and formulas, particularly for variance
  • Knowledge of covariance and its role in combining datasets
  • Ability to interpret set operations in statistics, specifically union and sum of sets
NEXT STEPS
  • Study the properties of variance and covariance in statistics
  • Learn about the Wilcoxon rank test for comparing populations
  • Explore the implications of combining datasets with different sizes and distributions
  • Practice calculating standard deviation for combined datasets using real data examples
USEFUL FOR

Statisticians, data analysts, and anyone involved in statistical modeling or data analysis who needs to understand how to combine datasets and calculate their standard deviation accurately.

songoku
Messages
2,509
Reaction score
393
Homework Statement
Group A has standard deviation of 10 and group B has standard deviation of 20. If group A has 150 data and group B has 100 data, what is the standard deviation of A + B?
Relevant Equations
##\sigma^2=\frac{1}{n}\left(\Sigma x^2 -\frac{(\Sigma x)^2}{n}\right)##
I tried some workings but got me nowhere. I just want to ask whether this question is solvable, i.e the answer can be in numerical value. If yes, then I want to try a bit by myself before asking for hint here.

Thanks
 
Physics news on Phys.org
To be clear, by A+B, I assume you mean some set of data ##\{ c_i = a_i + b_i | a_i \in A,\ b_i \in B\}##.
In that case, the correlation between the ##a_i##s and associated ##b_i##s must be considered.
The general equation is Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##) +2 Cov(##X_A,\ Y_B##).
For uncorrelated random variables, ##X_A## and ##Y_B##, this becomes Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##)
 
  • Like
Likes   Reactions: WWGD and songoku
FactChecker said:
To be clear, by A+B, I assume you mean some set of data ##\{ c_i = a_i + b_i | a_i \in A,\ b_i \in B\}##.
In that case, the correlation between the ##a_i##s and associated ##b_i##s must be considered.
The general equation is Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##) +2 Cov(##X_A,\ Y_B##).
For uncorrelated random variables, ##X_A## and ##Y_B##, this becomes Var(##X_A+Y_B##) = Var(##X_A##) + Var(##Y_B##)
Ah I see, so basically this question not really making sense because the number of data in each group is not the same so A + B will result in some data in A has no match for data in B.

If the question is modified into finding the standard deviation if the data in A is combined with data in B (so now the total data is 250), can we solve it? Actually this is the one I tried and got stuck (so I thought maybe the information of the question is not enough)

Thanks
 
songoku said:
Ah I see, so basically this question not really making sense because the number of data in each group is not the same so A + B will result in some data in A has no match for data in B.

If the question is modified into finding the standard deviation if the data in A is combined with data in B (so now the total data is 250), can we solve it? Actually this is the one I tried and got stuck (so I thought maybe the information of the question is not enough)

Thanks
It can be solved if we assume that the groups are taken from the same population and have the same mean.
 
  • Like
Likes   Reactions: songoku
songoku said:
Ah I see, so basically this question not really making sense because the number of data in each group is not the same so A + B will result in some data in A has no match for data in B.
The first problem is that the meaning of "A+B" is undefined, or at least not clear to me. Do you mean the sum of random variables, ##X_A##, from A and ##X_B##, from B? In that case, you need to know which of the A samples match up and sum with which of the B samples.
songoku said:
If the question is modified into finding the standard deviation if the data in A is combined with data in B (so now the total data is 250), can we solve it? Actually this is the one I tried and got stuck (so I thought maybe the information of the question is not enough)
So you are talking about drawing samples of a random variable, X, from the union of A and B, ##A \cup B##. Are the samples drawn randomly uniformly from ##A \cup B##?
In that case, you should be able to use the standard equation for ##\sigma^2## that you gave above. Apply it to the entire 250 elements. Why do you say that it didn't work?
 
  • Like
Likes   Reactions: songoku and WWGD
Maybe to clarify , are these samples from two populations A, B, or do these describe the whole population of interest?
You may do some tests to determine if the data comes from different populations. I believe the Wilcoxon rank test is one such non-parametric test.
 
  • Like
Likes   Reactions: songoku
What about the property ## \sigma_{A+B}^2 = \sigma_A^2 + \sigma_B^2 ## ?
 
  • Like
Likes   Reactions: FactChecker
Gavran said:
What about the property ## \sigma_{A+B}^2 = \sigma_A^2 + \sigma_B^2 ## ?
The OP defines A and B as sets. So A+B is not the sum of random variables. It is the sum of sets, whatever that means.
If you are talking about the sum of random variables, the formula is ##\sigma_{X+Y}^2 = \sigma_{X}^2 +\sigma_{Y}^2 + 2 cov(X,Y)##. Your "property" is wrong in general and only right for uncorrelated variables.
On the other hand, if you are talking about the union of sets, ##C=A\cup B##, with a random variable, ##X##, drawn with uniform distribution from ##C##, then it is still wrong. Consider the single-element sets ##A=\{0\}, B=\{100\}##. Clearly, ##\sigma_A = \sigma_B = 0## but ##\sigma_C = 50##.
 
FactChecker said:
The first problem is that the meaning of "A+B" is undefined, or at least not clear to me. Do you mean the sum of random variables, ##X_A##, from A and ##X_B##, from B? In that case, you need to know which of the A samples match up and sum with which of the B samples.

So you are talking about drawing samples of a random variable, X, from the union of A and B, ##A \cup B##. Are the samples drawn randomly uniformly from ##A \cup B##?
In that case, you should be able to use the standard equation for ##\sigma^2## that you gave above. Apply it to the entire 250 elements. Why do you say that it didn't work?
I am not really sure how to interpret the question. I posted the exact question, word by word.

In my opinion, it makes more sense if the interpretation is not the sum of random variables but maybe sum of sets. Group A has 150 data with standard deviation of 10 and group B has standard deviation of 20 with 100 data. Let say I combine all data into one set, set C, so this set contains 250 data and I want to find the standard deviation of C.

This is what I did:
For group A:
$$\sigma_{a}^{2}=\frac{1}{n_a} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{n_a}\right)$$
$$100=\frac{1}{150} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{150}\right)$$
$$\Sigma a^2=15000+\frac{(\Sigma a)^2}{150}....(1)$$

For group B:
$$\sigma_{b}^{2}=\frac{1}{n_b} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{n_b}\right)$$
$$400=\frac{1}{100} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{100}\right)$$
$$\Sigma b^2=40000+\frac{(\Sigma b)^2}{100}....(2)$$

For group C:
$$\sigma_{c}^{2}=\frac{1}{n_c} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)$$
$$=\frac{1}{250} \left(\Sigma a^2 +\Sigma b^2 - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$
$$=\frac{1}{250}\left(15000+\frac{(\Sigma a)^2}{150} + 40000+\frac{(\Sigma b)^2}{100} - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$

Then I stuck.

Thanks
 
  • #10
songoku said:
I am not really sure how to interpret the question. I posted the exact question, word by word.

In my opinion, it makes more sense if the interpretation is not the sum of random variables but maybe sum of sets. Group A has 150 data with standard deviation of 10 and group B has standard deviation of 20 with 100 data. Let say I combine all data into one set, set C, so this set contains 250 data and I want to find the standard deviation of C.

This is what I did:
For group A:
$$\sigma_{a}^{2}=\frac{1}{n_a} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{n_a}\right)$$
$$100=\frac{1}{150} \left(\Sigma a^2 - \frac{(\Sigma a)^2}{150}\right)$$
$$\Sigma a^2=15000+\frac{(\Sigma a)^2}{150}....(1)$$

For group B:
$$\sigma_{b}^{2}=\frac{1}{n_b} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{n_b}\right)$$
$$400=\frac{1}{100} \left(\Sigma b^2 - \frac{(\Sigma b)^2}{100}\right)$$
$$\Sigma b^2=40000+\frac{(\Sigma b)^2}{100}....(2)$$

For group C:
$$\sigma_{c}^{2}=\frac{1}{n_c} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)$$
$$=\frac{1}{250} \left(\Sigma a^2 +\Sigma b^2 - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$
$$=\frac{1}{250}\left(15000+\frac{(\Sigma a)^2}{150} + 40000+\frac{(\Sigma b)^2}{100} - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$

Then I stuck.

Thanks
Have you tried the approach of the post #4 or you find it unreasonable?
 
  • Like
Likes   Reactions: songoku
  • #11
Hill said:
Have you tried the approach of the post #4 or you find it unreasonable?
Oh I did that and I got ##\sqrt{220}## as the answer. I thought Factchecker was talking about something else, not using the assumption in post#4.

Thanks
 
Last edited:
  • #12
songoku said:
I am not really sure how to interpret the question. I posted the exact question, word by word.

In my opinion, it makes more sense if the interpretation is not the sum of random variables but maybe sum of sets. Group A has 150 data with standard deviation of 10 and group B has standard deviation of 20 with 100 data. Let say I combine all data into one set, set C, so this set contains 250 data and I want to find the standard deviation of C.


For group C:
$$\sigma_{c}^{2}=\frac{1}{n_c} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)$$
$$=\frac{1}{250} \left(\Sigma a^2 +\Sigma b^2 - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$
$$=\frac{1}{250}\left(15000+\frac{(\Sigma a)^2}{150} + 40000+\frac{(\Sigma b)^2}{100} - \frac{(\Sigma a+\Sigma b)^2}{250}\right)$$

Then I stuck.
You are not stuck. You are done.
I can not checked your arithmetic for group C, but that is the correct approach (given certain assumptions about what your question means)
If you want to consider the combined set C as the entire population of possible values of a random variable drawn uniformly from C, then you have calculated the variance of that random variable.
If you want to consider the combined set C as the set of sample results, then you should make one change to your equation. It should be ##\sigma_{c}^{2}=\frac{1}{n_c -1} \left(\Sigma c^2 - \frac{(\Sigma c)^2}{n_c}\right)##. The divisor is reduced by 1 because the population mean is being estimated.


PS. When you combine two sets into one, IMO, you should use the union symbol, ##A \cup B##, rather than a plus sign.
 
Last edited:
  • Like
Likes   Reactions: songoku
  • #13
Oh ok, it means I can't get the answer in numerical value.

Thank you very much for the help and explanation FactChecker, Hill, WWGD, Gavran
 
  • Like
Likes   Reactions: Hill
  • #14
songoku said:
Oh ok, it means I can't get the answer in numerical value.

Thank you very much for the help and explanation FactChecker, Hill, WWGD, Gavran
Oh, wait! I thought that you had the values of the summations of all the elements in ##A \cup B##. Don't you have that? How did you get the means of A and B?
 
  • #15
FactChecker said:
Oh, wait! I thought that you had the values of the summations of all the elements in ##A \cup B##. Don't you have that? How did you get the means of A and B?
I posted all the questions in OP, that's everything. I don't know the values of the summations of all the elements in ##A \cup B## and I don't have the means of A and B.
 
  • #16
songoku said:
I posted all the questions in OP, that's everything. I don't know the values of the summations of all the elements in ##A \cup B## and I don't have the means of A and B.
Sorry, I misunderstood.

Interpreting A+B as ##A \cup B##:
There is no way to solve it. Consider three simpler problems, all with the same individual 0 (or undefined, if you wish) standard deviations for ##A## and ##B## but significantly different standard deviations for ##A \cup B##:
1) A={0}, B={1}. ##\sigma_{sample A\cup B} = 0.70710678## and ##\sigma_{population A\cup B} = 0.5##
2) A={0}, B={10}. ##\sigma_{sample A\cup B} = 7.0710678## and ##\sigma_{population A\cup B} = 5##
3) A={0}, B={100}. ##\sigma_{sample A\cup B} = 70.710678## and ##\sigma_{population A\cup B} = 50##

If you don't like the 0 or undefined standard deviations for single-element sets A and B, you can easily make multiple-element examples.

Interpreting A+B as ##\{a+b| a\in A, b\in B, \text {selected independently and randomly}\}##:
Then apply ##\sigma_{A+B}^2 = \sigma_A^2 + \sigma_B^2## as @Gavran stated in post #7.
Since this is the only interpretation of A+B with a solution, it is probably the correct interpretation.
 
Last edited:
  • Like
Likes   Reactions: songoku
  • #17
I understand.

Thank you very much FactChecker
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 42 ·
2
Replies
42
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
4
Views
13K
  • · Replies 2 ·
Replies
2
Views
15K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 1 ·
Replies
1
Views
13K
  • · Replies 5 ·
Replies
5
Views
2K