Confidence interval for the weighted mean

In summary, the algorithm produces normally distributed random numbers with unknown mean and variance. I need to know the mean of those numbers. I calculated the weighted mean that minimizes the variance of the 8 means. The confidence interval for the weighted mean can be calculated using the t quantile.
  • #1
Cristiano
33
0
I run an algorithm that outputs a normally distributed random number with unknown mean and variance. I need to know the mean of those numbers.

To do that, i repeatedly run that algorithm to obtain many numbers, then I calculate the mean and the confidence interval for the mean.

Over the last years, I ran 8 of those simulations with different parameters (the population mean doesn't change).
For each simulation I have: the number of the random numbers, their mean and the confidence interval for the mean.

Since I need only a single value for m, I calculated the weighted mean that minimizes the variance of the 8 means.
Is there any way to calculate the confidence interval for the weighted mean?

Thank you
 
Physics news on Phys.org
  • #2
I think that, if instead of minimising the variance you minimised the weighted variance, where each simulation mean was weighted by the number of random numbers taken from that simulation, the problem would become the same as an Ordinary Least Squares Regression with no explanatory variables, and the confidence interval for the mean would be the confidence interval for the intercept of the regression, which most statistical packages will give you.
 
  • #3
I should write the code for the calculations and I'd be grateful if you could tell me how to do that.
 
  • #4
Actually, forget the regression. I've remembered an easier way using a t statistic. You use the last formula in this 'Theoretical Example' section of this wiki article.

Make all your random numbers into a vector of length ##n## and ignore which of the eight simulations they came from (which is irrelevant).

Then calculate the mean ##\bar{x}## and standard deviation ##s## of that vector.

The only remaining value you need for the formula is the t quantile ##c##. If you want a ##p##% confidence interval then ##c## is the t ##\frac{1+p}{2}## quantile of the t distribution with ##n-1## degrees of freedom. Most mathematical and software packages have built in functions that calculate quantiles of t distributions. Or, since it doesn't need to be recalculated each time, you can just look it up.
 
  • #5
andrewkirk said:
Make all your random numbers into a vector of length ##n## and ignore which of the eight simulations they came from (which is irrelevant).

I think that it's not irrelevant, because each simulation gives a different variance.
I'm currently using that method to calculate the confidence interval for a single simulation, but are you really sure that I can merge all the simulation without any corrective factor?
 
  • #6
Cristiano said:
I think that it's not irrelevant, because each simulation gives a different variance.
Do you mean that the population variance changes between simulations, or only that the sample variance changes? If the latter then you certainly can combine the simulations, as they all come from the same population. If the former then the above test will not apply. More detailed specification will be needed about how and why the population variance changes while the population mean remains unchanged, and a test could be designed based on that additional information.
 
  • #7
andrewkirk said:
Do you mean that the population variance changes between simulations, or only that the sample variance changes? If the latter then you certainly can combine the simulations, as they all come from the same population. If the former then the above test will not apply.

In theory, both the population mean and variance don't change between simulations, but to be honest, I'm not really sure. While the mean is certainly the same, the variance seems affected by the internal parameters of the Monte Carlo simulation, because in subsequent simulation that I use to validate the main simulation, I get good results when I don't combine the simulations, while when I combine the simulations I get a strange shape for the distribution of the samples (I use a kernel density estimation, asymmetry and kurtosis).

andrewkirk said:
More detailed specification will be needed about how and why the population variance changes while the population mean remains unchanged, and a test could be designed based on that additional information.

I'm doing a Monte Carlo simulation to obtain the critical values for many quantiles of some distributions used in non parametric testing of the null hypothesis (like K-S and A-D).
Since the K-S statistics can be known exactly, I use it to validate all the other simulations. In this thread I'm talking about the quantiles of the K-S distribution.

Suppose that I have the following simulations (##n_i## is the number of samples used for the ith simulation):
n1= 58, mean1= 0.25375918, var1= 0.00017443
n2= 77, mean2= 0.25665387, var2= 0.00016731
n3= 195, mean3= 0.25282767, var3= 0.00006521

If I simply take all the samples as a single simulation, I get a confidence interval of 0.000951386, both the asymmetry and kurtosis are a bit high and the shape of the distribution seems multimodal.

If I combine the simulations to get the smallest possible variance of the 3 means to calculate the weighted mean, I get a confidence interval of 0.000845679 and the weighted mean is the best possible estimation of the true mean, both the asymmetry and kurtosis are good and the shape shows a good normal distribution.
 
  • #8
If for different simulations the means are the same but unknown and the variances are different then we have no choice but to treat each simulation as a separate population, which rules out tactics such as trying to calculate a weighted variance.
To get a confidence interval for the unknown mean, we could consider a collection of hypothesis tests against a null hypothesis in each case that the true mean is equal to the upper limit of the confidence interval. The probability that the t-statistics from the different sims are all at least as far from zero (ie that the sample mean is at least that number of std devs from the true mean) can be set to the desired p-value, and the upper limit that solves that equation determined.
Using that approach, an equation for the upper limit ##u## of a ##1-\alpha## confidence interval might be along the lines of:

$$\prod_{k=1}^N t_{n_k-1}\left(\frac{m_k-u}{s_k}\right)=\frac{\alpha}{2}$$
where ##m_k,s_k,n_k## are the sample mean, sample standard deviation and number of observations from simulation ##k##, ##N## is the number of sims and ##t_r:\mathbb{R}\to\mathbb{R}## is the CDF of the t-distribution with ##r## degrees of freedom.
I suspect the equation has no analytic solution, but it could be easily solved numerically.
The corresponding equation for the lower limit ##l## might be:
$$\prod_{k=1}^N \left[1-t_{n_k-1}\left(\frac{m_k-l}{s_k}\right)\right]=\frac{\alpha}{2}$$
 
Last edited:
  • #9
I don't understand: ##t_r:\mathbb{R}\to\mathbb{R}##, but can you confirm that I should calculate the CDF of the t-distribution for ##\alpha/2## with ##n_k-1## degrees of freedom?
If that is correct and $$\frac{u-m_k}{s_k}$$ can be used instead of $$\frac{m_k-u}{s_k}$$, I get a very big (and wrong) confidence interval. For example, for a 90% CI, I get u - true_mean = 0.0771484, while when I use the normal distribution I get u - true_mean = 0.00760976, which is not perfect, but reasonably good.

By the way, thank you very much for your help and your patience. :-)
 
  • #10
@Cristiano No we can't do that swap shown in your post. I would expect a big and wrong answer if the first is used instead of the second.
It's possible the whole formula is wrong, as I just dashed it off without much checking, but it's more likely to be right if we use the t stat I suggested. When the equation is solved it should give negative t stats.
With the figures you provided, I get ##l## and ##u## to be approximately 0.2483 and 0.2598 for a 95% two-sided confidence interval.
Here is some R code that calculated it

n1= 58
m1= 0.25375918
var1= 0.00017443
n2= 77
m2= 0.25665387
var2= 0.00016731
n3= 195
m3= 0.25282767
var3= 0.00006521

v=c(var1,var2,var3)
m=c(m1,m2,m3)
n=c(n1,n2,n3)

u=.2598
y=((m-u)/sqrt(v))
z=pt(y,n-1)
prod(z)

l=.2483
y=((m-l)/sqrt(v))
z=pt(y,n)
prod(1-z)
 
  • Like
Likes Cristiano
  • #11
Unfortunately, I don't know how to calculate negative t stats. Moreover, if I use the normal distribution to find the CI, the result is reasonably good (if the simulation parameters are properly chosen).

Thank you very much.
 
  • #12
Cristiano said:
Unfortunately, I don't know how to calculate negative t stats.
Are you referring to the notation ##t_r{}^{-1}##? If so, set your mind at rest, as the -1 exponent is a mistake and shouldn't be there. I have corrected it above. The calcs in my later post are correct though.
BTW, for future reference, ##t_R{}^{-1}## just means the inverse function of the CDF ##t_R##.
It is calculated by the function qt(cum_probability,degrees_of_freedom) in R and
t.inv(cum_probability,degrees_of_freedom) in MS Excel. The CDF is calculated by
pt(statistic,degrees_of_freedom) in R and
t.dist(statistic,degrees_of_freedom,TRUE) in MS EXCEL, and its first argument can be positive or negative.

You are right though that with more than 40 degrees of freedom in each simulation, the Normal Dist will be a good enough approximation to the t-dist.
 
  • Like
Likes Cristiano
  • #13
Now I got the point! :-)
(m - u) / s is not a multiplication with t, but it's the argument for the t distribution.
Now I get l= 0.248240 and u= 0.259861.

All that said, the CI obtained with that method is too big; it's a 100% CI! :-) For example, I get u - l = 0.003, while with the method I currently use (with the normal distribution) I get 0.00043, which is in good agreement with the CI calculated via simulation done with known statistics.
 
  • #14
Cristiano said:
For example, I get u - l = 0.003, while with the method I currently use (with the normal distribution) I get 0.00043, which is in good agreement with the CI calculated via simulation done with known statistics.
How confident are you of the validity of the method you used to get that confidence interval? It places at least two of the three sample means outside the confidence interval which, given the large number of observations in each sample, should be a cause for concern.
 
  • #15
Cristiano said:
I run an algorithm that outputs a normally distributed random number with unknown mean and variance. I need to know the mean of those numbers.

To do that, i repeatedly run that algorithm to obtain many numbers, then I calculate the mean and the confidence interval for the mean.

Over the last years, I ran 8 of those simulations with different parameters (the population mean doesn't change).
For each simulation I have: the number of the random numbers, their mean and the confidence interval for the mean.

Since I need only a single value for m, I calculated the weighted mean that minimizes the variance of the 8 means.
Is there any way to calculate the confidence interval for the weighted mean?

Thank you
Sorry, kind of confused, maybe this is a dumb question: can a randomly-generated sequence be normally-distributed, or do you mean you draw numbers randomly from a normal population?
 
  • #16
andrewkirk said:
How confident are you of the validity of the method you used to get that confidence interval? It places at least two of the three sample means outside the confidence interval which, given the large number of observations in each sample, should be a cause for concern.

I'd say that I'm very confident because I simply calculate many weighted means and the related CI of one critical value of the well known K-S statistics (that means that I exactly know the true critical value), then I count how many weighted means lie outside the CI.
I also did a much simpler (and faster) simulation where the "complicated" K-S statistics is replaced by a normally distributed random number.
 
  • #17
WWGD said:
Sorry, kind of confused, maybe this is a dumb question: can a randomly-generated sequence be normally-distributed, or do you mean you draw numbers randomly from a normal population?

I draw uniformly distributed random numbers in [0,1).
Then I use those numbers to calculate the critical value of a particular quantile of the K-S statistics.
When many critical values are calculated (for the same quantile), those critical values are normally distributed with known mean but unknown variance (or, at least, I don't know the variance, but I can calculate the exact expected value).

I don't have any proof for the distribution of the critical values for the same quantile (I suppose that the central limit theorem kicks in), but I check that distribution using the K-S test (for normal distribution), an approximate A-D test, asymmetry, kurtosis and kernel density estimation.
 
  • #18
Cristiano said:
I simply calculate many weighted means and the related CI of one critical value of the well known K-S statistics
What is the null hypothesis for which you are calculating critical values?
I have a vague feeling that somewhere in there you might be assuming that all generated numbers come from the same population, and it's hard to reconcile that with the uncertainty expressed in post 7. But without knowing the details of the calc you are doing, an outsider can't form a judgement on that.
 
  • #19
andrewkirk said:
What is the null hypothesis for which you are calculating critical values?

The null hypothesis is that the input numbers are uniformly distributed.

andrewkirk said:
I have a vague feeling that somewhere in there you might be assuming that all generated numbers come from the same population, and it's hard to reconcile that with the uncertainty expressed in post 7. But without knowing the details of the calc you are doing, an outsider can't form a judgement on that.

I'm aware that my English is not good, but I think that I gave all the details in this thread and I'm happy with the method that I'm currently using to calculate the CI. The problem was that a parameter of the simulation was not properly chosen; as a consequence, the distribution of the critical values was heavily skewed, but now I generate many values and the distribution is almost perfectly normal and hence I can use the normal distribution to easily calculate the CI for the mean of the critical values.
 

1. What is a confidence interval for the weighted mean?

A confidence interval for the weighted mean is a range of values that is likely to contain the true population mean with a certain level of confidence. It takes into account the weight of each data point in the calculation of the mean, giving a more accurate estimate of the true mean.

2. How is a confidence interval for the weighted mean calculated?

The calculation for a confidence interval for the weighted mean involves using the sample data, the standard deviation of the sample, and the sample size. It also takes into account the level of confidence desired, typically 95%. The formula for calculating a confidence interval for the weighted mean can be found in most statistics textbooks or online.

3. Why is a confidence interval for the weighted mean important?

A confidence interval for the weighted mean is important because it gives us a range of values that is likely to contain the true population mean. This allows us to make more accurate inferences about the population based on our sample data. It also takes into account the varying weights of each data point, which can affect the accuracy of the mean.

4. Can a confidence interval for the weighted mean be used for any type of data?

Yes, a confidence interval for the weighted mean can be used for any type of data, as long as the data is numerical and normally distributed. It is a commonly used tool in statistics and can be applied to a variety of situations, such as measuring the average weight of a population or the average salary of a group of employees.

5. How can I interpret a confidence interval for the weighted mean?

A confidence interval for the weighted mean can be interpreted as follows: if we were to take multiple samples from the same population and calculate a confidence interval for the weighted mean for each sample, 95% of these intervals would contain the true population mean. This means that there is a 95% probability that the true population mean falls within the calculated confidence interval.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
724
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
662
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
475
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
994
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
Back
Top