# Error propagation for a sum of means

## Main Question or Discussion Point

I have a = {a1, a2, .., a1000}, where this set forms a distribution of photoelectrons (pe) seen by a particular photomultiplier tube (pmt) over 1000 repeated events. I then have N sets of these (N pmts), each containing 1000 pe values which I believe are indeed random and independent. So a, b, c, ... (not enough letters!) where a corresponds to pmt1, b corresponds to pmt2 etc.

a then goes into a histogram from which a mean and variance is extracted, this is then done for all N sets. I then define N variables pi = μiT where μi = mean from histogram i which in turn corresponds to a, and μT is the sum of the means (μi) from all N histograms.

I'm just now confused on how I'd calculate V[pi], i.e. for pmt1 given that I know E[ a ] and V[ a ]. So far I've thought that perhaps I can say μT = μ1 + μ2 + ... + μN = ((a1 + a2 + ... a1000) + (b1 + b2 + ... b1000) +... )/1000 and then therefore V[μT] = ((V[a1] + V[a2] + ... V[a1000]) + (V[b1] + V[b2] + ... V[b1000]) +... )/1000. I'm not sure however, that this is correct.. First of all I don't know if the logic is correct, and secondly I'm not sure whether the aj, bk and so on can be treated as independent and random variables (I think they are random and independent, but the means E[ a ], E[ b ] etc are not?)

Hopefully this hasn't been too confusing, and any ideas would be greatly appreciated. Cheers.

Related Set Theory, Logic, Probability, Statistics News on Phys.org
Stephen Tashi
I'm just now confused on how I'd calculate V[pi], i.e. for pmt1 given that I know E[ a ] and V[ a ].
I asssume "V(..)" means "variance of". However, "variance of" is an ambiguous phrase. You can "calculate" the variance of a sample. However, your goal might be to "estimate" the variance of some random variable - i.e. some population. One estimates the variance of a random variable by doing calculations on values in a sample. Those calculations often involve computing sample variances.

If you are trying to estimate the variance of a random variable, what is the definition of that random variable? For example, is it the mean number of electrons detected in 1000 events? Or is it the mean number of electrons detected in a single event?

Are the experiments, a,b,c,... all conducted under the same conditions?

• FatPhysicsBoy
I asssume "V(..)" means "variance of". However, "variance of" is an ambiguous phrase. You can "calculate" the variance of a sample. However, your goal might be to "estimate" the variance of some random variable - i.e. some population. One estimates the variance of a random variable by doing calculations on values in a sample. Those calculations often involve computing sample variances.

If you are trying to estimate the variance of a random variable, what is the definition of that random variable? For example, is it the mean number of electrons detected in 1000 events? Or is it the mean number of electrons detected in a single event?

Are the experiments, a,b,c,... all conducted under the same conditions?
Hello Stephen, thank you for your reply. I do indeed mean "variance of" when I say V[..]. However, the sets a, b, c, etc do not correspond to distinct experiments. In fact, a single experiment (or event) results in N values corresponding to each PMT in the detector a1, b1, c1, ..., NthLetter1 Event 2 then has a2, b2, c2, ..., NthLetter2 and so on for 1000 events. At the end of this I look at the distributions a, b, c, and so on. Obtaining the mean of these distributions gives me a photoelectron expectation value for PMTa, PMTb, PMTc, ..., PMTNthLetter . I then have a distribution of N expectation values formed from the expected photoelectron count at PMTi.

I then run another test event with identical conditions where the sets a, b, c, .. are now single valued, they only have one element in them - the number of photoelectrons seen by PMTi in that single event. This single photoelectron value is compared to the expected value for each PMT using a chi square goodness of fit test. At the moment my chi square statistic is (di - pi)2 / (pi), where pi is as defined in my initial post, and di comes from the single test event and is defined as (photoelectron_count_on_pmt_i)/(sum_of_photoelectron_counts_over_all_N_pmts). The chi square statistic is then summed over all N pmts.

The motivation behind wanting to know the variance of pi comes from wanting to explore effectiveness when the chi square statistic is weighted with the variance instead of pi.

Stephen Tashi
First, I'll just attempt to rephrase the problem in different notation.

Let $Y = (Y_1,Y_2,...Y_M)$ be a vector of $M$ possibly distinct random variables representing the counts recorded by each of $M$ photo multiplier tubes (PMTs) in one experiment.

Let $y[\ ][\ ]$ be an array of observed data, where $y[\ i][j]$ gives the count on the $i$th PMT in the $j$th experiment, $i = 1,2,..M$, $j = 1,2....N$. The $N$ experiments are assumed to be independent realizations of $Y$.

Let $z[\ i]$ be an array of data $i = 1,2,...M$ representing the observed count on each PMT in an single experiment done under possibly different conditions that the experiments mentioned above. (i.e. $z[\ ]$ is not from an experiment that produced the data in $y[\ ][\ ]$ ).

The goal is to do a hypothesis test using the null hypothesis that $z[\ ]$ is a realization of $Y$.
The general question is how to define and compute a Chi-squared statistic for this test.

------
Let $\mu[\ i] = \frac{1}{n} \sum_{j=1}^N y[\ i] [j]$

Let $S = \sum_{i=1}^M \mu[\ i]$.

You define $p[\ i] = \frac{\mu[\ i]}{S}$.

The definition of $p[\ i ]$ "corresponds" to a random variable $P_i = Y_i/ (\ \sum_{j=1}^M Y_j \ )$.

- or perhaps its "corresponds" to $P_i = Y_i/(\sum_{j=1}^M E(Y_j) )$ ?

You want to know how to estimate the variance of $P_i$.

Last edited:
• FatPhysicsBoy
First, I'll just attempt to rephrase the problem in different notation.

Let $Y = (Y_1,Y_2,...Y_M)$ be a vector of $M$ possibly distinct random variables representing the counts recorded by each of $M$ photo multiplier tubes (PMTs) in one experiment.

Let $y[\ ][\ ]$ be an array of observed data, where $y[\ i][j]$ gives the count on the $i$th PMT in the $j$th experiment, $i = 1,2,..M$, $j = 1,2....N$. The $N$ experiments are assumed to be independent realizations of $Y$.

Let $z[\ i]$ be an array of data $i = 1,2,...M$ representing the observed count on each PMT in an single experiment done under possibly different conditions that the experiments mentioned above. (i.e. $z[\ ]$ is not from an experiment that produced the data in $y[\ ][\ ]$ ).

The goal is to do a hypothesis test using the null hypothesis that $z[\ ]$ is a realization of $Y$.
The general question is how to define and compute a Chi-squared statistic for this test.

------
Let $\mu[\ i] = \frac{1}{n} \sum_{j=1}^N y[\ i] [j]$

Let $S = \sum_{i=1}^M \mu[\ i]$.

You define $p[\ i] = \frac{\mu[\ i]}{S}$.

The definition of $p[\ i ]$ "corresponds" to a random variable $P_i = Y_i/ (\ \sum_{j=1}^M Y_j \ )$.

- or perhaps its "corresponds" to $P_i = Y_i/(\sum_{j=1}^M E(Y_j) )$ ?

You want to know how to estimate the variance of $P_i$.
Hello Stephen, yes everything is correct. $p[\ i ]$ "corresponds" to $P_i = Y_i/ (\ \sum_{j=1}^M Y_j \ )$ where the $Y$ used here comes from $z[]$ and not $y[][]$.

I realised I made a mistake when interpreting the chi-square equation and that the variance I want is indeed the one you mention and not that for $p[\ i]$. You see currently I am using $\chi^{2} = \sum_{i=1}^{M}\frac{(P_{i} - p[\ i])^{2}}{p[\ i]}$ but now I want to try $\chi^{2} = \sum_{i=1}^{M}\frac{(P_{i} - p[\ i])^{2}}{\sigma_{P_{i}}^{2}}$.

The general idea here is, I have my $y[\ i][j]$ where all $j$ 'experiments' are at the same position $x$, then the $p[\ i]$ give a 'percentage hit' e.g. If $x$ is right next to PMT20, then PMT20 will have a high value for $p$. Then $p$ and $p$ will be slightly lower, $p$ and $p$ will be slightly lower still, and so on.

I then do everything again at position $x_{1} \neq x$, I then get a whole new set $y_{1}[\ i][j]$. Maybe this time $x_{1}$ is right near PMT140, then a similar argument to above applies.

Finally, when I run the single 'experiment', I get z[] and subsequently my $P_{i}$, I do this at position $x$ also, then the $\chi^{2}$ for the set of $p[\ i]$ corresponding to $y[\ i][j]$ will be lower than the $\chi^{2}$ for the set of $p[\ i]$ corresponding to $y_{1}[\ i][j]$.

Stephen Tashi
Hello Stephen, yes everything is correct. $p[\ i ]$ "corresponds" to $P_i = Y_i/ (\ \sum_{j=1}^M Y_j \ )$ where the $Y$ used here comes from $z[]$ and not $y[][]$.
The straightforward way to estimate the mean and variance of $P_i$ is to compute realizations of $P_i$ from $y[\ ][\ ]$

Define $q[\ i][j] = \frac{ y[\ i][j] }{\sum_{k=1}^M y[\ k][j] }$
Estimate $E(P_i)$ by the sample mean of the samples $q[\ i],q[\ i],...q[\ i][N]$.
Estimate $Var(P_i)$ by the sample variance of that set of samples.

We have to face the question of whether Chi-square is useful in this case. There are caveats about using Chi-square when the data is proportions instead of integer counts. (e.g. http://stats.stackexchange.com/questions/104323/chi-square-analysis-percentages ).

When people say "You can use Chi-square" or "You can't use Chi-square" they refer to using the Chi-square statistic and relying on the usual numerical tables or formulas for its distribution. You can define a statistic whose formula is that of Chi-square in a situation where it does not have the distribution of the "real" Chi-square statistic. The statistic will have some distribution and that distribution may be useful in hypothesis testing.

In my opinion, the safest way to approach even moderately complicated statistical problems is to use simulation. If you have a rough model for the physics of the experiment, you can estimate the distribution of the statistic you have defined.

----

There is the question of what you are trying to accomplish by a hypothesis test. For example, the null hypothesis that "The array of data z[\ ] is a realization of $Y$" is different than the more specific null hypothesis "The value z is a realization of $Y_3$" .

• FatPhysicsBoy
The straightforward way to estimate the mean and variance of $P_i$ is to compute realizations of $P_i$ from $y[\ ][\ ]$

Define $q[\ i][j] = \frac{ y[\ i][j] }{\sum_{k=1}^M y[\ k][j] }$
Estimate $E(P_i)$ by the sample mean of the samples $q[\ i],q[\ i],...q[\ i][N]$.
Estimate $Var(P_i)$ by the sample variance of that set of samples.

We have to face the question of whether Chi-square is useful in this case. There are caveats about using Chi-square when the data is proportions instead of integer counts. (e.g. http://stats.stackexchange.com/questions/104323/chi-square-analysis-percentages ).

When people say "You can use Chi-square" or "You can't use Chi-square" they refer to using the Chi-square statistic and relying on the usual numerical tables or formulas for its distribution. You can define a statistic whose formula is that of Chi-square in a situation where it does not have the distribution of the "real" Chi-square statistic. The statistic will have some distribution and that distribution may be useful in hypothesis testing.

In my opinion, the safest way to approach even moderately complicated statistical problems is to use simulation. If you have a rough model for the physics of the experiment, you can estimate the distribution of the statistic you have defined.

----

There is the question of what you are trying to accomplish by a hypothesis test. For example, the null hypothesis that "The array of data z[\ ] is a realization of $Y$" is different than the more specific null hypothesis "The value z is a realization of $Y_3$" .
Thank you for your help Stephen, I will compute the variances using the method you suggested. I understand, this is supposed to be the first stage though. See how far we can go using chi-square and possibly replace this method with a likelihood method or some other determining method.

The aim is more in line with the first null hypothesis, "The array of data $z[\ ]$ is a realisation of $Y$.

What do you mean by simulation? How would you propose achieving this using simulation/otherwise?

Thank you.

Stephen Tashi
What do you mean by simulation? How would you propose achieving this using simulation/otherwise?
Define $O_i = \frac{ z[\ i]}{ \sum_{k=1}^M z[k]}$

Let's say we define the statistic $s$ by $s = \sum_{i=1}^M \frac{ ( O[\ i] - E(P_i))^2} {E(P_i)}$ , where we have estimated $E(P_i)$ by the method in the previous post.

If the data $z[\ ]$ produces a value such as $s = 4.8$ the question we need to answer is "How probable is it that $s \ge 4.8$ when $z [\ ]$ is a realization of the same random variables that generated the $y[\ ][\ ]$ data?

If we are intellectually confident that $s$ has a distribution for which there are statistical tables and formulae, we just consult those to find the probability that $s \ge 4.8$. In your problem, I am not intellectually confident that $s$ has any standard statistical distribution. I'm leery of the fact that the total count of events varies between different experiments and I'm leery of the fact that $O_i$ is not a count of events, but rather a proportion of counts.

1) The first simulation I would try is "bootstrapping". Since your $y[\ ][\ ]$ data is not large by the standards of modern computers, I would compute the value of $s$ for each experiment in the $y[\ ][\ ]$ data. ( i.e. For each experiment $j$ , set $z[\ i] = y[\ i][j]$ and compute $s$). Histogram the values of $s$ that are produced. This gives you an estimate of the distribution of $s$. You can look at the histogram and estimate the answer to "what is the probability $s \ge 4.8$ ?"

2) The next simulation, I would try is using a probability model to generate a histogram for $s$. For example, we can look at the total event counts in experiments $c_j = \sum_{i=1}^M y[\ i][j]$. Fit some integer valued statistical distribution to that data. Next look at how a given number of events are distributed among the detectors. Fit a distribution to that situation. Having the probability model, run many replications of it to generate simulated $z[\ ]$ data and histogram the distribution of $s$ that results.

3) I wouldn't let statistics lobotomize my knowledge of physics. You must know something about the physical aspects of the experiment. People develop simulations of physical experiments. If you can develop one, that is another method of generating simulated $z[\ ]$ data.

4) If publishing your work is a consideration, you should consult papers that were accepted for publication and see what sort of statistics they used. Applying statistics is a subjective matter. Different journals may have different standards.

• FatPhysicsBoy
Define $O_i = \frac{ z[\ i]}{ \sum_{k=1}^M z[k]}$

Let's say we define the statistic $s$ by $s = \sum_{i=1}^M \frac{ ( O[\ i] - E(P_i))^2} {E(P_i)}$ , where we have estimated $E(P_i)$ by the method in the previous post.

If the data $z[\ ]$ produces a value such as $s = 4.8$ the question we need to answer is "How probable is it that $s \ge 4.8$ when $z [\ ]$ is a realization of the same random variables that generated the $y[\ ][\ ]$ data?

If we are intellectually confident that $s$ has a distribution for which there are statistical tables and formulae, we just consult those to find the probability that $s \ge 4.8$. In your problem, I am not intellectually confident that $s$ has any standard statistical distribution. I'm leery of the fact that the total count of events varies between different experiments and I'm leery of the fact that $O_i$ is not a count of events, but rather a proportion of counts.

1) The first simulation I would try is "bootstrapping". Since your $y[\ ][\ ]$ data is not large by the standards of modern computers, I would compute the value of $s$ for each experiment in the $y[\ ][\ ]$ data. ( i.e. For each experiment $j$ , set $z[\ i] = y[\ i][j]$ and compute $s$). Histogram the values of $s$ that are produced. This gives you an estimate of the distribution of $s$. You can look at the histogram and estimate the answer to "what is the probability $s \ge 4.8$ ?"

2) The next simulation, I would try is using a probability model to generate a histogram for $s$. For example, we can look at the total event counts in experiments $c_j = \sum_{i=1}^M y[\ i][j]$. Fit some integer valued statistical distribution to that data. Next look at how a given number of events are distributed among the detectors. Fit a distribution to that situation. Having the probability model, run many replications of it to generate simulated $z[\ ]$ data and histogram the distribution of $s$ that results.

3) I wouldn't let statistics lobotomize my knowledge of physics. You must know something about the physical aspects of the experiment. People develop simulations of physical experiments. If you can develop one, that is another method of generating simulated $z[\ ]$ data.

4) If publishing your work is a consideration, you should consult papers that were accepted for publication and see what sort of statistics they used. Applying statistics is a subjective matter. Different journals may have different standards.
Ultimately all the power is in my hands at the moment since everything we are discussing is in-fact being done in simulation. We have the full spherical detector geometry and can run events as and when and where we like through monte carlo. I think this method has broken down for the reasons you stated. Essentially, it works a lot better when there is no need to take a proportion of counts. I have the freedom to do this for the purposes of this study so I have removed that step. I am now taking $y_{a}[\ i ][j]$,
$y_{b}[\ i ][j]$, $y_{c}[\ i ][j]$, .. where a, b, c, .. refer to different positions. I then take one of these, i.e. $y_{c}[\ i ][j]$ and compare the $y_{c}[\ i ]$ for each $j$ with the expected value at PMTi associated with $y_{a}[\ i ][j]$, $y_{b}[\ i ][j]$, $y_{c}[\ i ][j]$, .. respectively. This is infinitely more successful than when there was a $z[\ ]$ since $z[\ ]$ had in common only the position with the y sets but not the number of photoelectrons deposited.

Thank you for your help Stephen, I am happy going forward with my study following our discussion. Apologies for the confusing nature of my explanations, there is a lot going on with this experiment and so I am usually a little confused!