# How to calculate Standard Error for unequal sample sizes

1. Jun 9, 2015

### RichS

I've been given a set of samples, each has different sample size and mean (but not individual observations). I'm trying to figure out the population standard deviation so that I can estimate required sample size for certain confidence intervals.

My question is how do I do this? The standard textbook formula is:
StDev of population = StDev of means (standard error) * sqrt(sample size)

The problem is that this formula applies to equal sample size. In my case each sample size is different. How do I do this?

Someone suggested me to look into pooled variance and intuitively I'd think it should be a form of weighted average. So would the Satterthwaite Approximation give me the standard error that I'm looking for? Even if it does, what "sample size" should I put in the above formula [StDev of population = StDev of means (standard error) * sqrt(sample size)]

Many thanks,
Rich

2. Jun 9, 2015

### Staff: Mentor

What do you mean with "(but not individual observations)"?
It applies to every sample considered individually, independent of its size.

3. Jun 9, 2015

### mathman

Essentially convert the mean and standard deviation to first and second moment sums.
$s_1=n\mu,\ s_2=n(\sigma^2-\mu^2)$. Do this for both samples, where $\sigma=$standard deviation, $\mu$=mean, and n = sample size. Now add up the sums from each sample to get the moment sums for the the two together, from which you can get the mean and variance ($\sigma^2$) for the total.

4. Jun 9, 2015

### RichS

Thank you both for your swift responses. Very much appreciated. I probably didn't explain clearly. Here's a made-up example of my data (sorry I don't have the real one with me right now):

Sample Means (μ): 263, 343, 445, 655, 233, 324
Sample Size (n) : 34, 5, 76, 23, 43, 45

There's no other information, i.e. all I get is this. It's all aggregate information, definitely no individual data. Because of privacy concerns, the data provider will never give me the individual data (mfb, this is what I meant by "(but not individual observations)").

Because sample size is different each time, I can't apply the formula: StDev of population = StDev of means (standard error) * sqrt(sample size)

Hi Mathman, thanks for your formula but I don't even get standard deviation for each sample so is there other solution?

Basically I need to estimate the sample size required for confidence intervals but can't find a formula. I realised that Satterthwaite Approximation won't help me because it requires Stdev of each sample.

Thanks again!

5. Jun 10, 2015

### Staff: Mentor

You assume all samples come from the same distribution? Same mean, same "true" standard deviation?
Then your best estimate for the mean is a weighted mean of the sample means, where the weights are the sample sizes: the mean is the sum over (sample size)*(sample mean), divided by the sum of sample sizes.

The estimate for the standard deviation can be done in a similar way: sum sqrt(sample size)*abs(sample mean - total mean), divide by the square root of the sum of sample sizes.
Not completely sure this is right, but I tested it and it seems to work.

Whatever you implement, run tests with samples with a known distribution to verify the result is unbiased.

6. Jun 10, 2015

### mathman

If you don't have standard deviation for the samples, there is no way to get the standard deviation for the total. You can get the overall mean by averaging the means, weighted by the sample sizes.

7. Jun 10, 2015

### Staff: Mentor

You can get an estimate based on the differences between the samples, assuming every (unavailable) data point comes from the same distribution.

8. Jun 11, 2015

### RichS

Thank you mfb and mathman.

mfb, I think your formula (sum sqrt(sample size)*abs(sample mean - total mean), divide by the square root of the sum of sample sizes) is close to what I've been looking for. I wanted to find a way to get the weighted average of Standard Error based on unequal sample sizes but couldn't get my head around. Actually, this is not too hard to test in a spreadsheet. I'll do that when I have time.

Thanks very much to both of you again!
Rich

9. Jun 25, 2015

### RichS

Hi mfb,

I honestly thought your formula makes a lot of sense. However, when I just did some tests in spreadsheet it tends to underestimate standard deviation. Most of the time it's 60-80% of the "true" stdev but occasionally it's only 20%. I'm really puzzled by why this is. Could you help me please? Is there a way to reduce this error?

Here's what I did in spreadsheet: just let it generate 1000 random numbers range from 0 to 400. Then arbitrarily divide these 1000 numbers into 8 groups, each with a different sample size, ranging from 40 to 300. Then I'll estimate the Stdev and compare it with the true Stdev of these 1000 samples. I know I used random numbers, which violates your assumption about all samples come from the same distribution,same mean, same "true" standard deviation. I used this because the samples could actually have different means and "true" standard deviations. Is there a way I can adjust this?

Maybe I should do a test that meats your assumptions.

Sorry to bother you again and thanks very much.

Rich

10. Jun 25, 2015

### Staff: Mentor

Yeah, I'm not sure where the problem is. I tested it with 1000*5 numbers (grouped as 3+2) in a spreadsheet and it worked, then I ran more tests with python and it did not even with the same groups - with sufficient data it underestimated the deviation by ~25% on average for a large class of different group numbers and sizes, but not for all.
Probably needs a detailed calculation to find the formula for the best estimate.

If the different subsamples can have that, you are lost. There is no way to get conclusions then. But if your 1000 random numbers were all from the same distribution that does not happen.

11. Jun 25, 2015

### RichS

12. Jun 25, 2015

### RichS

I think I found the answer. It's not what I said. I'll post it after delivering my results which is due in a few hours time.

13. Jun 26, 2015

### RichS

Too bad, I thought I solved the problem but it's actually getting worse. In another little experiment I did, the estimated stdev is 300% of the 'true' stdev. I'm still struggling to understand this. Anyone has any ideas please?

14. Jun 26, 2015

### Staff: Mentor

That is a test if different datasets are compatible.

The problem is interesting enough for the long way:
Let N be the total sample size (sum of all subsets). There are I subsets, where I>1 to make the problem meaningful. All sums and products always run over those subsets.
Let Ni be the size of subset i, let Ai be the observed average in this subset. Let A be the total observed average, $A=\frac{1}{N} \sum N_i A_i$.
Assume that every data point in the sample follows a Gaussian distribution with (true) mean m and standard deviation $\sigma$.
The distribution of Ai will then follow a Gaussian with mean m and standard deviation $\frac{\sigma}{\sqrt{N_i}}$.

The total likelihood to observe the set {Ai} is
$$LH=\prod \frac{\sqrt{N_i}}{\sqrt{2\pi}\sigma} \exp\left( \frac{-N_i (A_i-m)^2}{2\sigma^2} \right)$$
The best estimate for $\sigma$ and m is a set that maximizes this likelihood.
Let's calculate -LLH=-log(LH) because this is easier to analyze:
$$-LLH=c+\sum \left( \frac{N_i(A_i-m)^2}{2\sigma^2} + log(\sigma)\right)$$
Where c is some constant coming from the constant prefactors. Simplify:
$$-LLH=c+I \log(\sigma) + \frac{1}{2\sigma^2} \sum N_i(A_i-m)^2$$
Calculate the derivative with respect to m:
$$\frac{d (-LLH)}{dm} = \frac{-1}{2\sigma^2}\sum 2N_i (A_i-m)$$
Setting it to zero we get $\sum N_i (A_i-m)=0$ or $\sum N_i A_i = Nm$ which is satisfied for m=A. Not surprising: using the observed average is the best estimate for the true average. We can plug it into the -LLH and calculate the derivative with respect to $\sigma$:
$$\frac{d (-LLH)}{d\sigma} = \frac{I}{\sigma} - \frac{1}{\sigma^3} \sum N_i(A_i-m)^2$$
Set it to zero again and after simplification we get:
$$\hat \sigma = \sqrt{\frac{1}{I} \sum N_i(A_i-m)^2}$$
This should be the best estimate (denoted by the hat) for the standard deviation in your original sample, given the averages in the subsamples and their sizes.

It is also possible to evaluate the second derivative of the log likelihood at that point to get an estimate on the uncertainty of this value. I get
$$\Delta \hat \sigma = \frac{\hat \sigma}{\sqrt{2 I}}$$
up to some prefactor of 2 or similar that might be missing.

15. Jun 29, 2015

### RichS

Thanks very much He. You've been very helpful.

I did some more testing on your formula. Sometimes it's good but sometimes it still produces a relatively large difference to the 'true' standard deviation. I think it's unavoidable as many other estimates.

I also did some tests on using the smallest sample size to in the standard error formula and it also seems reasonable, i.e. σ = (σ of the means) * √min(n) . It's interesting that min(n) gives better answer than avg(n).

Thanks again for your generous help!

Rich

16. Jun 30, 2015

### Josh S Thompson

Why don't you just do a weighted average with that formula
Sum[StDev of means (standard error)i * sqrt(sample sizei/total sample size)], i= 1...5

If the samples are comparable this should work,
It is also what I would put down if this was my homework problem

17. Jun 30, 2015

### Staff: Mentor

Well, you can be unlucky. No estimate can avoid that.

That certainly breaks down if min(n) is small compared to the sizes of the other samples, and I don't see why it should be better anywhere.

@Josh: That does not work.

18. Jun 30, 2015

### Josh S Thompson

Are all the samples the same experiment?

Idk what level you guys on but you can't solve this problem there is no way to know a true covariance?

19. Jun 30, 2015

### Staff: Mentor

I hope the different subsamples are not correlated. If they are: yes, then we are lost without information how.

20. Jun 30, 2015

### Josh S Thompson

yea idk, you gotta tell me what is the samples, I think if you experimenting you do independent experiments so independent distribution across variables.

I thought more samples reduce standard error why your formulas say different?

mfb, is your formula some kind of transform, how did you do that if you don't mind explaining.