Calculation of total standard deviation over samples of different size

In summary, the standard deviation is calculated as:-S(X1X2) = sqrt ( (n1-1)S²(X1) + (n2-1)S²(X2) ) / (n1 + n2 - 2)
  • #1
Simon666
93
0
Hello,

The standard deviation is calculated as:

http://www.mathsrevision.net/gcse/sdeviation2.gif

Now the problem I have is that how you calculate the standard deviation (more accurately?) over both samplesvif you have two samples of different size, n1 and n2, in which the level of the average µ1 and µ2 can change, but the distribution remains otherwise the same?

Does it make mathematical sense to calculate an "overall" standard deviation using both samples, supposing more means better accuracy? And how would this overall standard deviation then be calculated?

[Edit]Anyway, wikipedia[url] seems to suggest that the variance S(X1X2) = sqrt ( ( (n1-1)S²(X1) + (n2-1)S²(X2) ) / (n1 + n2 - 2) )

so for for example 3 samples of different size, is it then:

S(X1X2X3) = sqrt ( ( (n1-1)S²(X1) + (n2-1)S²(X2) + (n3-1)S²(X3) ) / (n1 + n2 + n3 - 3) ) ?
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
Are you asking about the notion of "pooling" standard deviations. This is the idea that comes up in the following setting

  1. You have two or more different samples, from populations for which the means may not all be the same
  2. You are will to to believe, or have evidence from somewhere, that the different populations have the same standard deviation

In other words, there may be differences in location among the populations but the variability is essentially the same.

If you perform a classical (normal distribution based) inference, it turns out that combining the individual data to calculate a single measure of variability results in tests and intervals that are preferable over those using the individual standard deviations or variances. The process is known as "pooling".

For two samples, the "pooled" variance is

[tex]
s^2_p = \frac{(n_1-1)s_1^2 + (n_2 -1) s_2^2}{n_1 + n_2 - 2}
[/tex]

Note that the two-sample confidence interval for the difference of two means in this case is

[tex]
\left(\overline x_1 -\overline x_2 \right) \pm t_{\frac{\alpha}2} \sqrt{\,s_p^2 \left(\frac 1 {n_1} + \frac 1 {n_2}\right)}
[/tex]

Also as you note - similar formulae exist for more than two samples.
 
  • #3
[tex]

s^2_p = \frac{n_1s_1^2 + n_2 s_2^2+ n_1(mu_1-mu)^2 +n_2(mu_2-mu)^2}{n_1 + n_2 }

[/tex]
where mu1 and mu2 are the respective sample means and mu is the pooled mean.

The previous post is not generally true. That is the particular case for testing the difference of means of two univariate normal distbns, when the two samples are independent, popln std dev's are unknown but assumed to be equal. The test statistic is Fisher's t.
 
Last edited:
  • #4
The comment about my CI is correct, and the particular instance to which it applies is the only situation where two scalar standard deviations are pooled with that formula (situation: comparison of two means, standard devs assumed equal), but this

[tex]
s^2_p = \frac{n_1s_1^2 + n_2 s_2^2+ n_1(mu_1-mu)^2 +n_2(mu_2-mu)^2}{n_1 + n_2 }
[/tex]
?

What is it?
 
Last edited:
  • #5
statdad said:
The comment about my CI is correct, and the particular instance to which it applies is the only situation where two scalar standard deviations are pooled with that formula (situation: comparison of two means, standard devs assumed equal), but this

[tex]
s^2_p = \frac{n_1s_1^2 + n_2 s_2^2+ n_1(mu_1-mu)^2 +n_2(mu_2-mu)^2}{n_1 + n_2 }
[/tex]
?

What is it?

This is nothing but the square of pooled s.d. of two groups of observations where,
for the i th group :
the mean is mu_i,
sd is s_i,
number of observations is n_i,
and mu is the pooled mean. i=1,2.

I don't understand why the question of a "situation" arose where the samples are independent?

statdad said:
The comment about my CI is correct,
NO, NOT IN GENERAL. Consider the case when the two population means are known to be "not equal".
 
Last edited:
  • #6
The standard deviations being unequal was the situation being discussed.
I still don't know where your formula comes from - why the need for the squares of the difference in means?
 
  • #7
Are you a person of Statistics?
Unequal population s.d. case- is that you were taking about for t-test statistic and/or CI? Then you are wrong. The formula you gave stands only in the case of EQUAL popln s.d. assumption.

statdad said:
I still don't know where your formula comes from - why the need for the squares of the difference in means?

Simply, from definition of standard deviation. It worked out in many textbooks of descriptive statistics of high-school standard.
 
Last edited:
  • #8
I have a feeling I know far more statistics than you, but I could be wrong. I would be more than willing to discuss backgrounds, but that is not what this forum is for. I still say
your formula is not the formula for pooled variance in the t-test/confidence interval being discussed. It is not the formula for the variance of a single sample - where did you find it? reference?
 
  • #9
statdad said:
I have a feeling I know far more statistics than you, but I could be wrong. I would be more than willing to discuss backgrounds, but that is not what this forum is for. I still say
your formula is not the formula for pooled variance in the t-test/confidence interval being discussed. It is not the formula for the variance of a single sample - where did you find it? reference?

I am sorry if my words did hurt you. I simply thought you to be man of other subject.
Reference you are looking for : Any standard textbook of 10+2 standard.

I NEVER said, the expression I gave has something to do with t-test . Pls don't superimpose your wrong thoughts over my words.

I don't understand from where t statistic comming from. Pls have a look at op's question (and the thread title as well). This is a situation of descriptive stats (as I mentioned earlier). The basic question was of the expression of pooled sd of two groups of observations. In search of its answer, he found and posted a link about t stat...which is quite irrelevant to the answer he is looking for.

From where did I get the expression: I knew it and worked it out first time 25 years back, and regularly proving/teaching this till date. (And no one could disproved it upto 9th nov, 2008.)

If you cannot workout from definition of sd (or sd^2) then for your information I am attaching the elementory calculations worked out in standard text. Hope you will be able to understand them.

Pls note: The symbols for means and pooled sd in the print is different than what I wrote (they used x_bar, where I used mu. For the pooled sd, they wrote s, I used s_p. But I think you will understand that it really makes no difference in the mathematics).


2qitstu.jpg


Ps:
About non equality of the popln s.d.s :
statdad said:
The standard deviations being unequal was the situation being discussed.
You claimed that you were talking about the case of unequal sds!
Have you heard of "Fisher-Behrens Problem"? In which condition that is applicable? Does you CI expression hold then?
 

Attachments

  • 2qitstu.jpg
    2qitstu.jpg
    26.1 KB · Views: 436
Last edited:
  • #10
Hi ssd,

I have been searching various forums, posts & references trying to find a way of combining standard deviations. I think the method you quote is what I'm looking for, so could you please tell me, who the Author is of the book (Fundamentals of Statistics), you quoted earlier?

Thanks
 
  • #11
The Authors are: A.M.Goon, M.K. Gupta, B. Dasgupta. (All from India).
Publisher: The World Press Private Ltd.
Info: More than 7 editions have been done before 1990.
 
  • #12
Hi ssd,

Thank you for scanning the page. I have just one question, and I hope you can enlighten me. :)

The equation in the middle of the page gives:

sum_j(x_1j -x_bar)^2 = sum_j{(x_1j - x1_bar) + (x1_bar -x_bar)}^2

why it is equal to

sum_j(x_1j - x1_bar)^2 + n1 (x1_bar -x_bar)^2

isn't there a part missing

sum_j(2*(x_1j - x1_bar)(x1_bar -x_bar)) ?

Please help me to understand this.
Thanks
 
  • #13
[tex] \sum_j (x_{1j} - \bar x )^2 = \sum_j ( ( x_{1j} - \bar x_1 ) + (\bar x_1 - \bar x ))^2 [/tex]
[tex] = \sum_j ( x_{1j} - \bar x_1) ^2 + \sum_j (\bar x_1 - \bar x )^2 + 2 \sum_j (( x_{1j} - \bar x_1)(\bar x_1 - \bar x )) [/tex]
[tex] = \sum_j ( x_{1j} - \bar x_1 )^2 + \sum_j (\bar x_1 - \bar x )^2 + 2 (\bar x_1 - \bar x ) \sum_j( x_{1j} - \bar x_1 ) [/tex]
[tex] = \sum_j ( ( x_{1j} - \bar x_1 )^2 + \sum_j (\bar x_1 - \bar x )^2 + 2 (\bar x_1 - \bar x )( \sum_j x_{1j} - \sum_j \bar x_1 ) [/tex]
[tex] = \sum_j ( x_{1j} - \bar x_1 )^2 + \sum_j (\bar x_1 - \bar x )^2 + 2 (\bar x_1 - \bar x ) (n_1 \bar x_1 - n_1 \bar x_1 ) [/tex]
[tex] = \sum_j ( x_{1j} - \bar x_1 )^2 + \sum_j (\bar x_1 - \bar x )^2 + 2 (\bar x_1 - \bar x ) (0) [/tex]
[tex] = \sum_j ( x_{1j} - \bar x_1 )^2 + \sum_j (\bar x_1 - \bar x )^2 [/tex]
[tex] = \sum_j ( x_{1j} - \bar x_1 )^2 + n_1 (\bar x_1 - \bar x )^2 [/tex]

As I understand the scanned text, if you want to compute the pooled sample variance of two samples, you could simply treat all the data as one sample and do it that way. In this age of computers, that might be simplest thing to do. The formula for the pooled variance gives exactly the same result as doing that.

As pointed out in previous posts, there is a difference between the variance of a sample (pooled or otherwise) and a computation using sample values that is proposed as an estimator of the population variance (or of the common variance of two different populations). The formula for the sample variance may or may not be the best estimator for the population variance. Whether it is, depends on how you define "best" and what information is known about the distribution of the population.
 
  • #14
Hi Stephen Tashi,

Thank you very much for the detailed explanation. Now it is clear to me why the term disappeared. :)

For your text comments, I want to make sure I understand them correctly, and I am trying to describe the problem in my words and please tell me whether it makes sense.

Here I have for example 5 groups of data, each has ni data points in the group (i=1:5).For each of the groups I have already calculated the mean and std using:

xi_bar = Ʃxi/ni
σ^2 = (xi-xi_bar)^2/(ni-1), ni is the number of data points in group i.

So if I want to pool 5 groups together and calculate the total mean and std, I can either use the equation above on all data points Ʃni, or I can use the equation for pooled std and the mean will be

x_bar = Ʃ(xi_bar*ni)/(Ʃni-1), right?

The results from two methods are the same?

I do not quite understand what 'the best estimator' you mean from your last paragraph.
Can you again spend some time to explain this to me, thanks.
 
  • #15
The mean and variance of a sample, are formulas which have standard definitions. The field of study that states these definitions is "descriptive statistics". If you give numbers for the mean and variance of a sample, then people will assume you obeyed these formulas - or used formulas which give exactly the same numerical answers. It's merely a matter of obeying standard conventions.

When you want to use the numbers in a sample to estimate the mean and variance of a population (or a "random variable") there are no set rules for what formula you can use. What you do will depend on what you know about the distribution of the population.

There are three different concepts involved:
1) The properties of the sample ( such as its mean and variance)
2) The properties of the population ( such its mean and variance)
3) The formulae or procedures that you apply to the data in the sample to estimate the properties of the population.

For example, suppose the population is defined by a random variable X that has a discrete distribution with two unknown parameters A and B. Suppose we know that X has only 3 values with non-zero probabilities and that these are given by:probability that X = M + A is 1/3
probabiltiy that X = M - A is 1/3
probabiltiy that X = M is 1/3

Suppose we take a sample of 4 random draws from this distribution and the results are:
{ -3, 1, 5, 5 }. Then we know "by inspection" that M = 1 and A = 4. The mean of the population is therefore 1. (There is a standard definition for the mean of a distribution and if you apply it to the above list of probabilities, using M = 1 and A = 4, you get that the mean is 1.)

However, if you state that you have computed the mean of the sample , this tells people that you are stating the number ( -3 + 1 + 5 + 5)/ 4. You aren't supposed to say that mean of the sample is 1 even though you know that the sample implies that the mean of the population is 1.

Suppose you have a sample of N values of the random variable X and let the sample mean be [itex] \bar x [/itex]. I'm not an expert in descriptive statistics, but I think that if you state a number for the sample variance, it is always suposed to be the number:

[tex] \frac {\sum (x - \bar x)^2}{N} [/tex]

and not the number:

[tex] \frac {\sum (x - \bar x)^2}{N-1} [/tex]If you are estimating the variance of the population, you are free to use the latter formula and people advocate doing this when N is "small". To understand why, you have to study the statistical theory of "estimators".

--------------------

So if I want to pool 5 groups together and calculate the total mean and std, I can either use the equation above on all data points Ʃni, or I can use the equation for pooled std and the mean will be

x_bar = Ʃ(xi_bar*ni)/(Ʃni-1), right?

No. You wouldn't divide by [itex] \sum n_i - 1. [/itex] Divide by [itex] \sum n_i [/itex].
 
Last edited:

What is the purpose of calculating the total standard deviation over samples of different size?

The purpose of calculating the total standard deviation over samples of different size is to measure the variability or spread of a set of data points. It allows us to understand how much the data points deviate from the mean and how dispersed they are from each other.

How is the total standard deviation over samples of different size calculated?

The total standard deviation over samples of different size is calculated by taking the square root of the sum of the squared differences between each data point and the mean, divided by the total number of data points. This is also known as the root mean square deviation.

What are the limitations of using total standard deviation over samples of different size?

One limitation is that it assumes the data follows a normal distribution. If the data is skewed or has outliers, the standard deviation may not accurately represent the variability of the data. Additionally, the calculation can be affected by the sample size, with larger sample sizes resulting in smaller standard deviations even if the data is similar.

What is the difference between total standard deviation and sample standard deviation?

Total standard deviation considers the entire population of data points, while sample standard deviation only considers a subset or sample of the data. As a result, the calculation for sample standard deviation uses a slightly different formula to account for the smaller sample size.

How can total standard deviation over samples of different size be used in scientific research?

Total standard deviation can be used to compare the variability of different sets of data, identify outliers, and make predictions about future data. It can also be used to determine the confidence interval for a set of data, which is important for making statistical inferences in scientific research.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
25
Views
10K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
93K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
7K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
26K
Back
Top