How to calculate Standard Error for unequal sample sizes

• RichS
In summary: Mathman, could you please check my spreadsheet and tell me if your formula underestimate standard deviation? I think it might be because I'm not using a random number generator but just dividing samples into ranges.Thanks in advance!In summary, Gurus, can you help please? Gurus, can you help please? I've been given a set of samples, each has different sample size and mean (but not individual observations). I'm trying to figure out the population standard deviation so that I can estimate required sample size for certain confidence intervals. My question is how do I do this? The standard textbook formula is:StDev of population = StDev of means (standard error) * sq
RichS

I've been given a set of samples, each has different sample size and mean (but not individual observations). I'm trying to figure out the population standard deviation so that I can estimate required sample size for certain confidence intervals.

My question is how do I do this? The standard textbook formula is:
StDev of population = StDev of means (standard error) * sqrt(sample size)

The problem is that this formula applies to equal sample size. In my case each sample size is different. How do I do this?

Someone suggested me to look into pooled variance and intuitively I'd think it should be a form of weighted average. So would the Satterthwaite Approximation give me the standard error that I'm looking for? Even if it does, what "sample size" should I put in the above formula [StDev of population = StDev of means (standard error) * sqrt(sample size)]Many thanks,
Rich

What do you mean with "(but not individual observations)"?
RichS said:
he problem is that this formula applies to equal sample size.
It applies to every sample considered individually, independent of its size.

Josh S Thompson
Essentially convert the mean and standard deviation to first and second moment sums.
$s_1=n\mu,\ s_2=n(\sigma^2-\mu^2)$. Do this for both samples, where $\sigma=$standard deviation, $\mu$=mean, and n = sample size. Now add up the sums from each sample to get the moment sums for the the two together, from which you can get the mean and variance ($\sigma^2$) for the total.

Thank you both for your swift responses. Very much appreciated. I probably didn't explain clearly. Here's a made-up example of my data (sorry I don't have the real one with me right now):

Sample Means (μ): 263, 343, 445, 655, 233, 324
Sample Size (n) : 34, 5, 76, 23, 43, 45

There's no other information, i.e. all I get is this. It's all aggregate information, definitely no individual data. Because of privacy concerns, the data provider will never give me the individual data (mfb, this is what I meant by "(but not individual observations)").

Because sample size is different each time, I can't apply the formula: StDev of population = StDev of means (standard error) * sqrt(sample size)

Hi Mathman, thanks for your formula but I don't even get standard deviation for each sample so is there other solution?

Basically I need to estimate the sample size required for confidence intervals but can't find a formula. I realized that Satterthwaite Approximation won't help me because it requires Stdev of each sample.

Thanks again!

You assume all samples come from the same distribution? Same mean, same "true" standard deviation?
Then your best estimate for the mean is a weighted mean of the sample means, where the weights are the sample sizes: the mean is the sum over (sample size)*(sample mean), divided by the sum of sample sizes.

The estimate for the standard deviation can be done in a similar way: sum sqrt(sample size)*abs(sample mean - total mean), divide by the square root of the sum of sample sizes.
Not completely sure this is right, but I tested it and it seems to work.

Whatever you implement, run tests with samples with a known distribution to verify the result is unbiased.

If you don't have standard deviation for the samples, there is no way to get the standard deviation for the total. You can get the overall mean by averaging the means, weighted by the sample sizes.

You can get an estimate based on the differences between the samples, assuming every (unavailable) data point comes from the same distribution.

Thank you mfb and mathman.

mfb, I think your formula (sum sqrt(sample size)*abs(sample mean - total mean), divide by the square root of the sum of sample sizes) is close to what I've been looking for. I wanted to find a way to get the weighted average of Standard Error based on unequal sample sizes but couldn't get my head around. Actually, this is not too hard to test in a spreadsheet. I'll do that when I have time.

Thanks very much to both of you again!
Rich

Hi mfb,

I honestly thought your formula makes a lot of sense. However, when I just did some tests in spreadsheet it tends to underestimate standard deviation. Most of the time it's 60-80% of the "true" stdev but occasionally it's only 20%. I'm really puzzled by why this is. Could you help me please? Is there a way to reduce this error?

Here's what I did in spreadsheet: just let it generate 1000 random numbers range from 0 to 400. Then arbitrarily divide these 1000 numbers into 8 groups, each with a different sample size, ranging from 40 to 300. Then I'll estimate the Stdev and compare it with the true Stdev of these 1000 samples. I know I used random numbers, which violates your assumption about all samples come from the same distribution,same mean, same "true" standard deviation. I used this because the samples could actually have different means and "true" standard deviations. Is there a way I can adjust this?

Maybe I should do a test that meats your assumptions.

Sorry to bother you again and thanks very much.

Rich

Yeah, I'm not sure where the problem is. I tested it with 1000*5 numbers (grouped as 3+2) in a spreadsheet and it worked, then I ran more tests with python and it did not even with the same groups - with sufficient data it underestimated the deviation by ~25% on average for a large class of different group numbers and sizes, but not for all.
Probably needs a detailed calculation to find the formula for the best estimate.

RichS said:
I used this because the samples could actually have different means and "true" standard deviations.
If the different subsamples can have that, you are lost. There is no way to get conclusions then. But if your 1000 random numbers were all from the same distribution that does not happen.

Thanks again He for your swift reply and your own testing. Does this provide any hint on where the problem is?
https://www.physicsforums.com/threa...iation-over-samples-of-different-size.268377/

Below is a clip from this page. Maybe n1S12 is what's missing?

I think I found the answer. It's not what I said. I'll post it after delivering my results which is due in a few hours time.

Too bad, I thought I solved the problem but it's actually getting worse. In another little experiment I did, the estimated stdev is 300% of the 'true' stdev. I'm still struggling to understand this. Anyone has any ideas please?

RichS said:
Thanks again He for your swift reply and your own testing. Does this provide any hint on where the problem is?
https://www.physicsforums.com/threa...iation-over-samples-of-different-size.268377/
That is a test if different datasets are compatible.The problem is interesting enough for the long way:
Let N be the total sample size (sum of all subsets). There are I subsets, where I>1 to make the problem meaningful. All sums and products always run over those subsets.
Let Ni be the size of subset i, let Ai be the observed average in this subset. Let A be the total observed average, ##A=\frac{1}{N} \sum N_i A_i##.
Assume that every data point in the sample follows a Gaussian distribution with (true) mean m and standard deviation ##\sigma##.
The distribution of Ai will then follow a Gaussian with mean m and standard deviation ##\frac{\sigma}{\sqrt{N_i}}##.

The total likelihood to observe the set {Ai} is
$$LH=\prod \frac{\sqrt{N_i}}{\sqrt{2\pi}\sigma} \exp\left( \frac{-N_i (A_i-m)^2}{2\sigma^2} \right)$$
The best estimate for ##\sigma## and m is a set that maximizes this likelihood.
Let's calculate -LLH=-log(LH) because this is easier to analyze:
$$-LLH=c+\sum \left( \frac{N_i(A_i-m)^2}{2\sigma^2} + log(\sigma)\right)$$
Where c is some constant coming from the constant prefactors. Simplify:
$$-LLH=c+I \log(\sigma) + \frac{1}{2\sigma^2} \sum N_i(A_i-m)^2$$
Calculate the derivative with respect to m:
$$\frac{d (-LLH)}{dm} = \frac{-1}{2\sigma^2}\sum 2N_i (A_i-m)$$
Setting it to zero we get ##\sum N_i (A_i-m)=0## or ##\sum N_i A_i = Nm## which is satisfied for m=A. Not surprising: using the observed average is the best estimate for the true average. We can plug it into the -LLH and calculate the derivative with respect to ##\sigma##:
$$\frac{d (-LLH)}{d\sigma} = \frac{I}{\sigma} - \frac{1}{\sigma^3} \sum N_i(A_i-m)^2$$
Set it to zero again and after simplification we get:
$$\hat \sigma = \sqrt{\frac{1}{I} \sum N_i(A_i-m)^2}$$
This should be the best estimate (denoted by the hat) for the standard deviation in your original sample, given the averages in the subsamples and their sizes.

It is also possible to evaluate the second derivative of the log likelihood at that point to get an estimate on the uncertainty of this value. I get
$$\Delta \hat \sigma = \frac{\hat \sigma}{\sqrt{2 I}}$$
up to some prefactor of 2 or similar that might be missing.

Thanks very much He. You've been very helpful.

I did some more testing on your formula. Sometimes it's good but sometimes it still produces a relatively large difference to the 'true' standard deviation. I think it's unavoidable as many other estimates.

I also did some tests on using the smallest sample size to in the standard error formula and it also seems reasonable, i.e. σ = (σ of the means) * √min(n) . It's interesting that min(n) gives better answer than avg(n).

Thanks again for your generous help!

Rich

RichS said:
Thank you both for your swift responses. Very much appreciated. I probably didn't explain clearly. Here's a made-up example of my data (sorry I don't have the real one with me right now):

Sample Means (μ): 263, 343, 445, 655, 233, 324
Sample Size (n) : 34, 5, 76, 23, 43, 45

There's no other information, i.e. all I get is this. It's all aggregate information, definitely no individual data. Because of privacy concerns, the data provider will never give me the individual data (mfb, this is what I meant by "(but not individual observations)").

Because sample size is different each time, I can't apply the formula: StDev of population = StDev of means (standard error) * sqrt(sample size)

Hi Mathman, thanks for your formula but I don't even get standard deviation for each sample so is there other solution?

Basically I need to estimate the sample size required for confidence intervals but can't find a formula. I realized that Satterthwaite Approximation won't help me because it requires Stdev of each sample.

Thanks again!
Why don't you just do a weighted average with that formula
Sum[StDev of means (standard error)i * sqrt(sample sizei/total sample size)], i= 1...5

If the samples are comparable this should work,
It is also what I would put down if this was my homework problem

RichS said:
I did some more testing on your formula. Sometimes it's good but sometimes it still produces a relatively large difference to the 'true' standard deviation. I think it's unavoidable as many other estimates.
Well, you can be unlucky. No estimate can avoid that.

I also did some tests on using the smallest sample size to in the standard error formula and it also seems reasonable, i.e. σ = (σ of the means) * √min(n) . It's interesting that min(n) gives better answer than avg(n).
That certainly breaks down if min(n) is small compared to the sizes of the other samples, and I don't see why it should be better anywhere.@Josh: That does not work.

mfb said:
Well, you can be unlucky. No estimate can avoid that.

That certainly breaks down if min(n) is small compared to the sizes of the other samples, and I don't see why it should be better anywhere.@Josh: That does not work.
Are all the samples the same experiment?

Idk what level you guys on but you can't solve this problem there is no way to know a true covariance?

I hope the different subsamples are not correlated. If they are: yes, then we are lost without information how.

mfb said:
I hope the different subsamples are not correlated. If they are: yes, then we are lost without information how.
yea idk, you got to tell me what is the samples, I think if you experimenting you do independent experiments so independent distribution across variables.

I thought more samples reduce standard error why your formulas say different?

mfb, is your formula some kind of transform, how did you do that if you don't mind explaining.

Josh S Thompson said:
I thought more samples reduce standard error why your formulas say different?
It does not do that.
More samples (larger I) reduce the uncertainty of the estimate.
If you refer to the sample size: larger samples reduce the spread of the sample means, but we want to use this spread to estimate the deviation of individual data points, which is larger than the spread of the means - the larger the sample the more significant this difference becomes.
mfb, is your formula some kind of transform, how did you do that if you don't mind explaining.
Which part is unclear? It is a maximum likelihood estimation.

mfb said:
I hope the different subsamples are not correlated. If they are: yes, then we are lost without information how.
In fact, I don't know for sure if these sub-samples are correlated or not. Let me explain a little more.

Basically I'm looking at vehicle travel time between point A and B. These're collected by collecting MAC addresses from people's Bluetooth devices (mostly mobiles). Because of the government restrictions, I have to get approval to access the raw data, which I didn't have time for. Therefore, they provided me with averages for every 5 minutes, together with corresponding sample sizes. Obviously, the sample sizes vary all the time, depending on how many drives have their Bluetooth turned on.

What I need to figure out is how traffic varies within the peak periods (7-9am, 4-6pm) so I need to estimate standard deviation.

Intuitively, you'd think these sub-samples are are correlated because surely the condition at this moment will somehow affect the next moment. However, it may or may not be the case. I vaguely remember papers saying that there's no correlations between traffic conditions on neighbouring road sections. We had some small datasets seem to confirm this. This is not exactly the same but might provide some warning on assuming correlation between time periods.

I'll try to get the raw data and compare with our estimates and see how close they got. I can come back and post the results if you're interested. :-)

Rich

5-min averages over what?
What exactly do you get and how is that related to A and B?

The samples for subsequent 5-minute steps are certainly correlated - traffic changes systematically as function of time, and traffic jams add some nonlinearity.

mfb said:
5-min averages over what?
What exactly do you get and how is that related to A and B?

The samples for subsequent 5-minute steps are certainly correlated - traffic changes systematically as function of time, and traffic jams add some nonlinearity.
The system calculates the average time it takes vehicles to travel between two locations (Point A to Point B), and the calculation is done on 5-min intervals. I hope I explained better. :-)

Hmm, so we have deviations from systematic trends (time of day) and statistical fluctuations (what is interesting). You could fit some smooth function to the observed values, and take the difference between fit and actual value as estimate for the statistical fluctuation. To improve the method, exclude the point considered from the fit, then the estimate is unbiased.

Another interesting point: depending on how the averages are done, the sample size itself probably depends on the traffic speed.

Great suggestions, thanks!

I only know that the sample size is about 30% of the population, which is higher than I thought. I was surprised to learn that this large number of people actually have their Bluetooth device turned on. However, I don't know more details on how the averages are calculated since I don't have the raw data. My email to request the raw data seems to have fallen into a black hole, which is not good news.

Cheers,
Rich

1. How do you calculate Standard Error for unequal sample sizes?

The formula for calculating Standard Error for unequal sample sizes is:

SE = √(σ12/n1 + σ22/n2)

Here, σ1 and σ2 represent the standard deviations of the two samples, and n1 and n2 represent the sample sizes. The formula takes into account the unequal sizes of the samples, providing a more accurate representation of the variability within the population.

2. Why is it important to calculate Standard Error for unequal sample sizes?

Calculating Standard Error for unequal sample sizes is important because it allows for a more precise estimate of the population mean. When using unequal sample sizes, the variability within the population is not evenly represented in each sample. By taking into account the different sample sizes, the Standard Error can provide a more accurate measure of the variability within the population.

3. Can Standard Error be calculated for more than two sample sizes?

Yes, the formula for calculating Standard Error can be extended to include more than two sample sizes. The formula becomes:

SE = √(σ12/n1 + σ22/n2 + ... + σn2/nn)

Here, σn represents the standard deviation of the nth sample, and nn represents the size of the nth sample.

4. Can Standard Error be negative?

No, Standard Error cannot be negative. It is a measure of the variability within a population and is always a positive value. If the calculated value for Standard Error is negative, it is likely due to an error in the calculation.

5. How is Standard Error related to Standard Deviation?

Standard Error and Standard Deviation are both measures of variability within a population. However, Standard Deviation measures the variability within a sample, while Standard Error measures the variability within the population. Standard Error is calculated by dividing the Standard Deviation by the square root of the sample size, and it becomes a more accurate estimate of the population variability as the sample size increases.

• Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
7
Views
742
• Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
24
Views
3K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
701
• Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
772