# Confidence interval for the weighted mean

1. Dec 11, 2015

### Cristiano

I run an algorithm that outputs a normally distributed random number with unknown mean and variance. I need to know the mean of those numbers.

To do that, i repeatedly run that algorithm to obtain many numbers, then I calculate the mean and the confidence interval for the mean.

Over the last years, I ran 8 of those simulations with different parameters (the population mean doesn't change).
For each simulation I have: the number of the random numbers, their mean and the confidence interval for the mean.

Since I need only a single value for m, I calculated the weighted mean that minimizes the variance of the 8 means.
Is there any way to calculate the confidence interval for the weighted mean?

Thank you

2. Dec 11, 2015

### andrewkirk

I think that, if instead of minimising the variance you minimised the weighted variance, where each simulation mean was weighted by the number of random numbers taken from that simulation, the problem would become the same as an Ordinary Least Squares Regression with no explanatory variables, and the confidence interval for the mean would be the confidence interval for the intercept of the regression, which most statistical packages will give you.

3. Dec 11, 2015

### Cristiano

I should write the code for the calculations and I'd be grateful if you could tell me how to do that.

4. Dec 11, 2015

### andrewkirk

Actually, forget the regression. I've remembered an easier way using a t statistic. You use the last formula in this 'Theoretical Example' section of this wiki article.

Make all your random numbers into a vector of length $n$ and ignore which of the eight simulations they came from (which is irrelevant).

Then calculate the mean $\bar{x}$ and standard deviation $s$ of that vector.

The only remaining value you need for the formula is the t quantile $c$. If you want a $p$% confidence interval then $c$ is the t $\frac{1+p}{2}$ quantile of the t distribution with $n-1$ degrees of freedom. Most mathematical and software packages have built in functions that calculate quantiles of t distributions. Or, since it doesn't need to be recalculated each time, you can just look it up.

5. Dec 12, 2015

### Cristiano

I think that it's not irrelevant, because each simulation gives a different variance.
I'm currently using that method to calculate the confidence interval for a single simulation, but are you really sure that I can merge all the simulation without any corrective factor?

6. Dec 12, 2015

### andrewkirk

Do you mean that the population variance changes between simulations, or only that the sample variance changes? If the latter then you certainly can combine the simulations, as they all come from the same population. If the former then the above test will not apply. More detailed specification will be needed about how and why the population variance changes while the population mean remains unchanged, and a test could be designed based on that additional information.

7. Dec 13, 2015

### Cristiano

In theory, both the population mean and variance don't change between simulations, but to be honest, I'm not really sure. While the mean is certainly the same, the variance seems affected by the internal parameters of the Monte Carlo simulation, because in subsequent simulation that I use to validate the main simulation, I get good results when I don't combine the simulations, while when I combine the simulations I get a strange shape for the distribution of the samples (I use a kernel density estimation, asymmetry and kurtosis).

I'm doing a Monte Carlo simulation to obtain the critical values for many quantiles of some distributions used in non parametric testing of the null hypothesis (like K-S and A-D).
Since the K-S statistics can be known exactly, I use it to validate all the other simulations. In this thread I'm talking about the quantiles of the K-S distribution.

Suppose that I have the following simulations ($n_i$ is the number of samples used for the ith simulation):
n1= 58, mean1= 0.25375918, var1= 0.00017443
n2= 77, mean2= 0.25665387, var2= 0.00016731
n3= 195, mean3= 0.25282767, var3= 0.00006521

If I simply take all the samples as a single simulation, I get a confidence interval of 0.000951386, both the asymmetry and kurtosis are a bit high and the shape of the distribution seems multimodal.

If I combine the simulations to get the smallest possible variance of the 3 means to calculate the weighted mean, I get a confidence interval of 0.000845679 and the weighted mean is the best possible estimation of the true mean, both the asymmetry and kurtosis are good and the shape shows a good normal distribution.

8. Dec 13, 2015

### andrewkirk

If for different simulations the means are the same but unknown and the variances are different then we have no choice but to treat each simulation as a separate population, which rules out tactics such as trying to calculate a weighted variance.
To get a confidence interval for the unknown mean, we could consider a collection of hypothesis tests against a null hypothesis in each case that the true mean is equal to the upper limit of the confidence interval. The probability that the t-statistics from the different sims are all at least as far from zero (ie that the sample mean is at least that number of std devs from the true mean) can be set to the desired p-value, and the upper limit that solves that equation determined.
Using that approach, an equation for the upper limit $u$ of a $1-\alpha$ confidence interval might be along the lines of:

$$\prod_{k=1}^N t_{n_k-1}\left(\frac{m_k-u}{s_k}\right)=\frac{\alpha}{2}$$
where $m_k,s_k,n_k$ are the sample mean, sample standard deviation and number of observations from simulation $k$, $N$ is the number of sims and $t_r:\mathbb{R}\to\mathbb{R}$ is the CDF of the t-distribution with $r$ degrees of freedom.
I suspect the equation has no analytic solution, but it could be easily solved numerically.
The corresponding equation for the lower limit $l$ might be:
$$\prod_{k=1}^N \left[1-t_{n_k-1}\left(\frac{m_k-l}{s_k}\right)\right]=\frac{\alpha}{2}$$

Last edited: Dec 13, 2015
9. Dec 13, 2015

### Cristiano

I don't understand: $t_r:\mathbb{R}\to\mathbb{R}$, but can you confirm that I should calculate the CDF of the t-distribution for $\alpha/2$ with $n_k-1$ degrees of freedom?
If that is correct and $$\frac{u-m_k}{s_k}$$ can be used instead of $$\frac{m_k-u}{s_k}$$, I get a very big (and wrong) confidence interval. For example, for a 90% CI, I get u - true_mean = 0.0771484, while when I use the normal distribution I get u - true_mean = 0.00760976, which is not perfect, but reasonably good.

By the way, thank you very much for your help and your patience. :-)

10. Dec 13, 2015

### andrewkirk

@Cristiano No we can't do that swap shown in your post. I would expect a big and wrong answer if the first is used instead of the second.
It's possible the whole formula is wrong, as I just dashed it off without much checking, but it's more likely to be right if we use the t stat I suggested. When the equation is solved it should give negative t stats.
With the figures you provided, I get $l$ and $u$ to be approximately 0.2483 and 0.2598 for a 95% two-sided confidence interval.
Here is some R code that calculated it

n1= 58
m1= 0.25375918
var1= 0.00017443
n2= 77
m2= 0.25665387
var2= 0.00016731
n3= 195
m3= 0.25282767
var3= 0.00006521

v=c(var1,var2,var3)
m=c(m1,m2,m3)
n=c(n1,n2,n3)

u=.2598
y=((m-u)/sqrt(v))
z=pt(y,n-1)
prod(z)

l=.2483
y=((m-l)/sqrt(v))
z=pt(y,n)
prod(1-z)

11. Dec 13, 2015

### Cristiano

Unfortunately, I don't know how to calculate negative t stats. Moreover, if I use the normal distribution to find the CI, the result is reasonably good (if the simulation parameters are properly chosen).

Thank you very much.

12. Dec 13, 2015

### andrewkirk

Are you referring to the notation $t_r{}^{-1}$? If so, set your mind at rest, as the -1 exponent is a mistake and shouldn't be there. I have corrected it above. The calcs in my later post are correct though.
BTW, for future reference, $t_R{}^{-1}$ just means the inverse function of the CDF $t_R$.
It is calculated by the function qt(cum_probability,degrees_of_freedom) in R and
t.inv(cum_probability,degrees_of_freedom) in MS Excel. The CDF is calculated by
pt(statistic,degrees_of_freedom) in R and
t.dist(statistic,degrees_of_freedom,TRUE) in MS EXCEL, and its first argument can be positive or negative.

You are right though that with more than 40 degrees of freedom in each simulation, the Normal Dist will be a good enough approximation to the t-dist.

13. Dec 14, 2015

### Cristiano

Now I got the point! :-)
(m - u) / s is not a multiplication with t, but it's the argument for the t distribution.
Now I get l= 0.248240 and u= 0.259861.

All that said, the CI obtained with that method is too big; it's a 100% CI! :-) For example, I get u - l = 0.003, while with the method I currently use (with the normal distribution) I get 0.00043, which is in good agreement with the CI calculated via simulation done with known statistics.

14. Dec 14, 2015

### andrewkirk

How confident are you of the validity of the method you used to get that confidence interval? It places at least two of the three sample means outside the confidence interval which, given the large number of observations in each sample, should be a cause for concern.

15. Dec 14, 2015

### WWGD

Sorry, kind of confused, maybe this is a dumb question: can a randomly-generated sequence be normally-distributed, or do you mean you draw numbers randomly from a normal population?

16. Dec 15, 2015

### Cristiano

I'd say that I'm very confident because I simply calculate many weighted means and the related CI of one critical value of the well known K-S statistics (that means that I exactly know the true critical value), then I count how many weighted means lie outside the CI.
I also did a much simpler (and faster) simulation where the "complicated" K-S statistics is replaced by a normally distributed random number.

17. Dec 15, 2015

### Cristiano

I draw uniformly distributed random numbers in [0,1).
Then I use those numbers to calculate the critical value of a particular quantile of the K-S statistics.
When many critical values are calculated (for the same quantile), those critical values are normally distributed with known mean but unknown variance (or, at least, I don't know the variance, but I can calculate the exact expected value).

I don't have any proof for the distribution of the critical values for the same quantile (I suppose that the central limit theorem kicks in), but I check that distribution using the K-S test (for normal distribution), an approximate A-D test, asymmetry, kurtosis and kernel density estimation.

18. Dec 15, 2015

### andrewkirk

What is the null hypothesis for which you are calculating critical values?
I have a vague feeling that somewhere in there you might be assuming that all generated numbers come from the same population, and it's hard to reconcile that with the uncertainty expressed in post 7. But without knowing the details of the calc you are doing, an outsider can't form a judgement on that.

19. Dec 15, 2015

### Cristiano

The null hypothesis is that the input numbers are uniformly distributed.

I'm aware that my English is not good, but I think that I gave all the details in this thread and I'm happy with the method that I'm currently using to calculate the CI. The problem was that a parameter of the simulation was not properly chosen; as a consequence, the distribution of the critical values was heavily skewed, but now I generate many values and the distribution is almost perfectly normal and hence I can use the normal distribution to easily calculate the CI for the mean of the critical values.