# I Estimating a systematic error

#### kelly0303

Hello! I am a bit confused about estimating the systematic error (I think it is systematic) from an experiment. Here is a (simplified) description of it. Assume that 2 groups measure the length of a cube with 2 different rulers, which, due to some effects give slightly different results (for example they are made of different materials and they have different lengths due to thermal expansion). Assume that they associate the same error $dx$ to each of their measurement (based on the grading of the ruler) and for each group the same number of measurements is made. After a big number of measurements the data is presented in a histogram and 2 peaks appear, with a similar standard deviation (although this is not that important) but 2 clearly separated means $\mu_1$ and $\mu_2$. Obviously one of the measurements is not right (or maybe both), so I think that this difference can be considered a systematic error. Now, if I had just one peak, the average length would be the mean of the gaussian peak and the error would be $\sigma/\sqrt{N}$, where N is the number of measurements and $\sigma$ their standard deviation. However, now I need to take into account the fact that I don't know which one of the 2 peaks is right so I need to attach an error associated to that, too. How should I do it properly? Thank you!

Related Set Theory, Logic, Probability, Statistics News on Phys.org

#### hutchphd

To say anything definitive you need to further characterize that population of rulers. Absent that they will simply swamp the other errors, making the N measurements more or less irrelevant. (In the absence of further information all the 2N measurements are equivalent and the bifurcation will be the predominant error source ) In practice it should mean either carefully recalibrate the ruler and perhaps discard the bad data (for cause) or get good rulers and repeat the experiment.

#### kelly0303

To say anything definitive you need to further characterize that population of rulers. Absent that they will simply swamp the other errors, making the N measurements more or less irrelevant. (In the absence of further information all the 2N measurements are equivalent and the bifurcation will be the predominant error source ) In practice it should mean either carefully recalibrate the ruler and perhaps discard the bad data (for cause) or get good rulers and repeat the experiment.
My questions was more general, assuming that repeating the experiment is very expensive and figuring out which one of the measuring devices (which in a real experiment can be very complex) is doing something wrong is difficult to figure out. The ruler example was just to make my question more clear, without going into physics details. So, assuming that is the data, and I have no further way to clean it, and I also know from theoretical calculations that I must have just one peak (or in this case, the length can't have 2 values), how would i present my result in a paper? Of course in a paper I will explain the situation and why I get these results, but I also must give a best estimate for the mean value of the length and the error associated to it. What is the best way to do so? Thank you!

#### hutchphd

The issues here are accuracy (repeatability} and precision (~calibration). Your accuracy exceeds your precision.
Can you characterize the measurement apparatus independently ex post facto somehow? Are the problems likely to be durable and therefore amenable to such analysis? Is it really only two "rulers"? I don't know any magic that passes the "sniff" test without compelling information
How do you know they were not both doing some thing wrong. That is my first question for you to answer........

#### kelly0303

The issues here are accuracy (repeatability} and precision (~calibration). Your accuracy exceeds your precision.
Can you characterize the measurement apparatus independently ex post facto somehow? Are the problems likely to be durable and therefore amenable to such analysis? Is it really only two "rulers"? I don't know any magic that passes the "sniff" test without compelling information
How do you know they were not both doing some thing wrong. That is my first question for you to answer........
I am not sure they are not both wrong. This is what I said in the first post, too. They might both still be wrong, but one still needs to give a best estimate of the length value, given these measurements i.e. if you publish a paper, you won't say the length of the cube is, say, $11 \pm 1$ cm and $14 \pm 1$ cm. The cube has just one length, so you have to combine them into just one value, which would obviously have a big error associated to it. My question is, how do you calculate the mean value and the error in this case? Just to give an actual real life example (the reason I avoided this is because it might be too complicated and have details that I am not aware of, but here it is): The lifetime of the neutron has 2 significantly different measured values: https://www.quantamagazine.org/neutron-lifetime-puzzle-deepens-but-no-dark-matter-seen-20180213/. One is $888\pm 2.1$ seconds, the other one is $879.3 \pm 0.75$ seconds. No one knows why they are different, so one of them at least is wrong (unless there is some unknown physics at play). Yet if you go on wikipedia you will see that the mean lifetime of the neutron is given as $881.5(15)$ seconds. So the two measurements were combined in a single value, even if they are not consistent with one another. Again, I am not sure about all the details of these experiments, so my example might not be very useful, but what I am trying to say is that, having an experiment measuring something and getting 2 very different values is not at all uncommon (same thing with proton radius, but that might have been almost solved). Also, you have no idea as to why they are different so if you want to give just one value, you have to combine the 2 measurements. My question is, what is the best way to do that?

#### Dale

Mentor
Hello! I am a bit confused about estimating the systematic error (I think it is systematic) from an experiment. Here is a (simplified) description of it. Assume that 2 groups measure the length of a cube with 2 different rulers, which, due to some effects give slightly different results (for example they are made of different materials and they have different lengths due to thermal expansion). Assume that they associate the same error $dx$ to each of their measurement (based on the grading of the ruler) and for each group the same number of measurements is made. After a big number of measurements the data is presented in a histogram and 2 peaks appear, with a similar standard deviation (although this is not that important) but 2 clearly separated means $\mu_1$ and $\mu_2$. Obviously one of the measurements is not right (or maybe both), so I think that this difference can be considered a systematic error. Now, if I had just one peak, the average length would be the mean of the gaussian peak and the error would be $\sigma/\sqrt{N}$, where N is the number of measurements and $\sigma$ their standard deviation. However, now I need to take into account the fact that I don't know which one of the 2 peaks is right so I need to attach an error associated to that, too. How should I do it properly? Thank you!
The best reference for handling uncertainty is in this guide from NIST

See sections 2.2 and 2.3 first. Uncertainty is no longer classified as random or systematic. It is now classified as uncertainty that is evaluated by statistical methods or uncertainty that is evaluated by other means. In this case, although your uncertainty would have formerly been classified as systematic, you are evaluating it by statistical methods.

Section 3 gives some scanty advice for dealing with this type of uncertainty. It suggests using ANOVA, which seems reasonable to me.

#### hutchphd

If the uncertainties in each reported number are simply considered RMS sampling errors (call them σ1 and σ2 then it makes sense to weight each mean value by 1/σ2 . (The easy way to think about this is that each σ is proportional to 1/√N where N is the number of measurements in each test). This gives the Wikipedia mean value I believe, and probably makes sense if you insist upon combining values.
How they get the uncertainty I don't know (frankly I never know for certain what that notation even means!!) . The number ±1.5 seems grotesquely wrong to me as does ±15.
Were I required to report a value I would choose the smaller value with its uncertainty and footnote the possible discrepancy. The longer value is more likely wrong in my estimation because failure to see decay events will produce a larger lifetime and this is more likely than seeing erroneous events I think. But there is no clean prescription.

#### gleem

The fact that you get two different estimates with two instruments alerts you to a problem with your measurements. Both must be suspect to start. Instruments are usually accompanied by an accuracy spec from the manufacturer based on some sort of calibration procedure. Ideally that spec should be verified by the user.

In your case you should obtain a certified (standard) length from a trusted agency and compare your rulers to it. Any discrepancy can then be compensated for. Such a standard will also have an uncertainty associated with it. This uncertainty should be considered systematic. I would add it to the statistical uncertainty. Ultimately you have to take into account all factors that contribute to the final value for example both rulers could be fine but give different values depending on the temperature of the object being measured.

My philosophy is that one probably knows the accuracy of his measurements less than he thinks.

#### hutchphd

In your case you should obtain a certified (standard) length from a trusted agency and compare your rulers to it.
I think you missed #5 from OP. I was unaware of the discrepancy and it makes the analysis less routine....

#### gleem

I think you missed #5 from OP. I was unaware of the discrepancy and it makes the analysis less routine....
No I read it before. I still don't see you point.

The neutron discrepancy is between two labs . The OP problem is internal and if in my lab would have to be resolved. I do not see any "good" way to combine the two measurements that are three sd apart made with two different instruments unless you combined all the measurements assuming the two instruments are equivalent and calc a mean and sd from that and sweep the issue under the rug. But I don't think that is honest.

Another approach is to let an unbiased person(s) look at the data and data collection process and evaluate it. Maybe they can see the problem.

I agree.

#### Dale

Mentor
Such a standard will also have an uncertainty associated with it. This uncertainty should be considered systematic.
Again, uncertainty is no longer classified this way, but you are correct that it is an uncertainty that is evaluated by other methods.

I would add it to the statistical uncertainty.
I agree. Specifically, add the variances, not the standard deviations.

#### kelly0303

No I read it before. I still don't see you point.

The neutron discrepancy is between two labs . The OP problem is internal and if in my lab would have to be resolved. I do not see any "good" way to combine the two measurements that are three sd apart made with two different instruments unless you combined all the measurements assuming the two instruments are equivalent and calc a mean and sd from that and sweep the issue under the rug. But I don't think that is honest.

Another approach is to let an unbiased person(s) look at the data and data collection process and evaluate it. Maybe they can see the problem.
I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.

#### kelly0303

I agree. Specifically, add the variances, not the standard deviations.

#### Dale

Mentor
If you have a statistically determined uncertainty $\sigma_1$ and an uncertainty $\sigma_2$ determined by other methods, then the combined uncertainty is $\sigma_c^2=\sigma_1^2+\sigma_2^2$ and not $\sigma_c=\sigma_1+\sigma_2$

#### kelly0303

If you have a statistically determined uncertainty $\sigma_1$ and an uncertainty $\sigma_2$ determined by other methods, then the combined uncertainty is $\sigma_c^2=\sigma_1^2+\sigma_2^2$ and not $\sigma_c=\sigma_1+\sigma_2$
Oh, sure! My problem in this case is that I am not sure what to use for each of the two. To give some numerical examples $\mu_1 = 50$, $\mu_2 = 60$, $\sigma_1 = 3$, $\sigma_2 = 4$ (the $\sigma_1$ and $\sigma_2$ are not necessarily the ones in your formula), what mean value and error on the mean would you use based on these values?

#### Dale

Mentor
Since you are treating both of these statistically the NIST approach recommended using ANOVA. That will give you a mean and a standard error on the mean. Since the standard deviations are different you would want to apply a correction for heteroskedacity. Most software packages will have one, but you will need to check the documentation.

Staff Emeritus
Here is what people do. For simplicity, I am assuming both peaks have the same number of entries and the same width; if that's not the case you can weight.

The estimator is the average of the peaks' positions. The uncertainty on the mean is the uncertainty on a single peak added in quadrature with the distance between the peaks.

This slightly overestimates the error, but b4etter that then the reverse.

#### gleem

I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.
Your OP made no sense to me unless it was in the same lab and for the same experiment.

Your neutron half life example is poor since the two experimental results are from completely different experiments. Differing results may be disconcerting. As I stated above when two measurements of the same quantity do not agree it "alerts" you to a possible issues wrt one or both experiments. In the case of the neutron half life measurements something might be going on that is overlooked and isn't that the current controversy? Look at the measurements for each experimental technique. The beam measurements have a larger uncertainty and the technique is probably difficult which I can appreciate. They measured the decay of the neutrons by measuring how fast proton + electron are being produced . The trapped neutron measurement had a significantly smaller uncertainty but measured the number of undecayed neutrons, how fast they are disappearing . I do not know how valid it is to average them. However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

So in your case are you talking about experiments measuring the same thing the same way or as in the neutron example using completely different techniques?

#### WWGD

Gold Member
I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.
Are you measuring the ( literally) same object , e.g., measuring it in one lab and shipping it to the second , or you're measuring the same line of product , e.g., acme product #213-x1 in both labs separately?

#### hutchphd

However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)
As mentioned previously they weighted them inversely according to the variance (which is the square of the SD). This is the "correct" way to do such a weighting as can be seen by arbitrarily dividing a homogeneous sample into sub populations of different size.
Whether it is appropriate here is dubious but it is correct in the simple case

#### kelly0303

Are you measuring the ( literally) same object , e.g., measuring it in one lab and shipping it to the second , or you're measuring the same line of product , e.g., acme product #213-x1 in both labs separately?
I don't have an actual experiment. I am just curious, just for my personal knowledge. Again, using the neutron lifetime, you can assume they are the literally same object (i.e. a neutron), but measured in two different labs.

#### kelly0303

Your OP made no sense to me unless it was in the same lab and for the same experiment.

Your neutron half life example is poor since the two experimental results are from completely different experiments. Differing results may be disconcerting. As I stated above when two measurements of the same quantity do not agree it "alerts" you to a possible issues wrt one or both experiments. In the case of the neutron half life measurements something might be going on that is overlooked and isn't that the current controversy? Look at the measurements for each experimental technique. The beam measurements have a larger uncertainty and the technique is probably difficult which I can appreciate. They measured the decay of the neutrons by measuring how fast proton + electron are being produced . The trapped neutron measurement had a significantly smaller uncertainty but measured the number of undecayed neutrons, how fast they are disappearing . I do not know how valid it is to average them. However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

So in your case are you talking about experiments measuring the same thing the same way or as in the neutron example using completely different techniques?
I don't have an actual experiment. I am just curious based on things I read online about experiments giving different results. I gave the ruler example just for simplicity. So in the formula for the mean you gave, I assume you meant to use the variance, not the standard deviation, right? However, my main issue is with how to calculate the error associated with this mean (again, assume in the case of neutron lifetime, in order to have a clear example).

#### kelly0303

Since you are treating both of these statistically the NIST approach recommended using ANOVA. That will give you a mean and a standard error on the mean. Since the standard deviations are different you would want to apply a correction for heteroskedacity. Most software packages will have one, but you will need to check the documentation.
Thank you for this! I am quite new to statistics in general, but isn't ANOVA a way to check if the two means are statistically different? Does it actually give you the overall mean and error of these measurements?

#### kelly0303

Here is what people do. For simplicity, I am assuming both peaks have the same number of entries and the same width; if that's not the case you can weight.

The estimator is the average of the peaks' positions. The uncertainty on the mean is the uncertainty on a single peak added in quadrature with the distance between the peaks.

This slightly overestimates the error, but b4etter that then the reverse.
Thanks a lot for this! I am pretty sure this is what I was looking for! So the error would be $\sqrt{\sigma_1^2+(\mu_1-\mu_2)^2}$. Is this right? One more question, assuming there are more than 2 peaks (again same number of events and same width), how would you consider the difference between the peaks in this case?