How Should Systematic Errors Be Estimated When Experiment Results Differ?

kelly0303 · Oct 17, 2019

Hello! I am a bit confused about estimating the systematic error (I think it is systematic) from an experiment. Here is a (simplified) description of it. Assume that 2 groups measure the length of a cube with 2 different rulers, which, due to some effects give slightly different results (for example they are made of different materials and they have different lengths due to thermal expansion). Assume that they associate the same error ##dx## to each of their measurement (based on the grading of the ruler) and for each group the same number of measurements is made. After a big number of measurements the data is presented in a histogram and 2 peaks appear, with a similar standard deviation (although this is not that important) but 2 clearly separated means ##\mu_1## and ##\mu_2##. Obviously one of the measurements is not right (or maybe both), so I think that this difference can be considered a systematic error. Now, if I had just one peak, the average length would be the mean of the gaussian peak and the error would be ##\sigma/\sqrt{N}##, where N is the number of measurements and ##\sigma## their standard deviation. However, now I need to take into account the fact that I don't know which one of the 2 peaks is right so I need to attach an error associated to that, too. How should I do it properly? Thank you!

hutchphd · Oct 17, 2019

To say anything definitive you need to further characterize that population of rulers. Absent that they will simply swamp the other errors, making the N measurements more or less irrelevant. (In the absence of further information all the 2N measurements are equivalent and the bifurcation will be the predominant error source ) In practice it should mean either carefully recalibrate the ruler and perhaps discard the bad data (for cause) or get good rulers and repeat the experiment.

kelly0303 · Oct 17, 2019

hutchphd said:

To say anything definitive you need to further characterize that population of rulers. Absent that they will simply swamp the other errors, making the N measurements more or less irrelevant. (In the absence of further information all the 2N measurements are equivalent and the bifurcation will be the predominant error source ) In practice it should mean either carefully recalibrate the ruler and perhaps discard the bad data (for cause) or get good rulers and repeat the experiment.

My questions was more general, assuming that repeating the experiment is very expensive and figuring out which one of the measuring devices (which in a real experiment can be very complex) is doing something wrong is difficult to figure out. The ruler example was just to make my question more clear, without going into physics details. So, assuming that is the data, and I have no further way to clean it, and I also know from theoretical calculations that I must have just one peak (or in this case, the length can't have 2 values), how would i present my result in a paper? Of course in a paper I will explain the situation and why I get these results, but I also must give a best estimate for the mean value of the length and the error associated to it. What is the best way to do so? Thank you!

hutchphd · Oct 17, 2019

The issues here are accuracy (repeatability} and precision (~calibration). Your accuracy exceeds your precision.
Can you characterize the measurement apparatus independently ex post facto somehow? Are the problems likely to be durable and therefore amenable to such analysis? Is it really only two "rulers"? I don't know any magic that passes the "sniff" test without compelling information
How do you know they were not both doing some thing wrong. That is my first question for you to answer...

kelly0303 · Oct 17, 2019

hutchphd said:

The issues here are accuracy (repeatability} and precision (~calibration). Your accuracy exceeds your precision.
Can you characterize the measurement apparatus independently ex post facto somehow? Are the problems likely to be durable and therefore amenable to such analysis? Is it really only two "rulers"? I don't know any magic that passes the "sniff" test without compelling information
How do you know they were not both doing some thing wrong. That is my first question for you to answer...

I am not sure they are not both wrong. This is what I said in the first post, too. They might both still be wrong, but one still needs to give a best estimate of the length value, given these measurements i.e. if you publish a paper, you won't say the length of the cube is, say, ##11 \pm 1## cm and ##14 \pm 1## cm. The cube has just one length, so you have to combine them into just one value, which would obviously have a big error associated to it. My question is, how do you calculate the mean value and the error in this case? Just to give an actual real life example (the reason I avoided this is because it might be too complicated and have details that I am not aware of, but here it is): The lifetime of the neutron has 2 significantly different measured values: https://www.quantamagazine.org/neutron-lifetime-puzzle-deepens-but-no-dark-matter-seen-20180213/. One is ##888\pm 2.1## seconds, the other one is ##879.3 \pm 0.75## seconds. No one knows why they are different, so one of them at least is wrong (unless there is some unknown physics at play). Yet if you go on wikipedia you will see that the mean lifetime of the neutron is given as ##881.5(15)## seconds. So the two measurements were combined in a single value, even if they are not consistent with one another. Again, I am not sure about all the details of these experiments, so my example might not be very useful, but what I am trying to say is that, having an experiment measuring something and getting 2 very different values is not at all uncommon (same thing with proton radius, but that might have been almost solved). Also, you have no idea as to why they are different so if you want to give just one value, you have to combine the 2 measurements. My question is, what is the best way to do that?

Dale · Oct 17, 2019

kelly0303 said:

Hello! I am a bit confused about estimating the systematic error (I think it is systematic) from an experiment. Here is a (simplified) description of it. Assume that 2 groups measure the length of a cube with 2 different rulers, which, due to some effects give slightly different results (for example they are made of different materials and they have different lengths due to thermal expansion). Assume that they associate the same error ##dx## to each of their measurement (based on the grading of the ruler) and for each group the same number of measurements is made. After a big number of measurements the data is presented in a histogram and 2 peaks appear, with a similar standard deviation (although this is not that important) but 2 clearly separated means ##\mu_1## and ##\mu_2##. Obviously one of the measurements is not right (or maybe both), so I think that this difference can be considered a systematic error. Now, if I had just one peak, the average length would be the mean of the gaussian peak and the error would be ##\sigma/\sqrt{N}##, where N is the number of measurements and ##\sigma## their standard deviation. However, now I need to take into account the fact that I don't know which one of the 2 peaks is right so I need to attach an error associated to that, too. How should I do it properly? Thank you!

The best reference for handling uncertainty is in this guide from NIST

https://www.nist.gov/sites/default/files/documents/2017/05/09/tn1297s.pdf

See sections 2.2 and 2.3 first. Uncertainty is no longer classified as random or systematic. It is now classified as uncertainty that is evaluated by statistical methods or uncertainty that is evaluated by other means. In this case, although your uncertainty would have formerly been classified as systematic, you are evaluating it by statistical methods.

Section 3 gives some scanty advice for dealing with this type of uncertainty. It suggests using ANOVA, which seems reasonable to me.

hutchphd · Oct 18, 2019

If the uncertainties in each reported number are simply considered RMS sampling errors (call them σ₁ and σ₂ then it makes sense to weight each mean value by 1/σ² . (The easy way to think about this is that each σ is proportional to 1/√N where N is the number of measurements in each test). This gives the Wikipedia mean value I believe, and probably makes sense if you insist upon combining values.
How they get the uncertainty I don't know (frankly I never know for certain what that notation even means!) . The number ±1.5 seems grotesquely wrong to me as does ±15.
Were I required to report a value I would choose the smaller value with its uncertainty and footnote the possible discrepancy. The longer value is more likely wrong in my estimation because failure to see decay events will produce a larger lifetime and this is more likely than seeing erroneous events I think. But there is no clean prescription.

gleem · Oct 18, 2019

The fact that you get two different estimates with two instruments alerts you to a problem with your measurements. Both must be suspect to start. Instruments are usually accompanied by an accuracy spec from the manufacturer based on some sort of calibration procedure. Ideally that spec should be verified by the user.

In your case you should obtain a certified (standard) length from a trusted agency and compare your rulers to it. Any discrepancy can then be compensated for. Such a standard will also have an uncertainty associated with it. This uncertainty should be considered systematic. I would add it to the statistical uncertainty. Ultimately you have to take into account all factors that contribute to the final value for example both rulers could be fine but give different values depending on the temperature of the object being measured.

My philosophy is that one probably knows the accuracy of his measurements less than he thinks.

hutchphd · Oct 18, 2019

gleem said:

In your case you should obtain a certified (standard) length from a trusted agency and compare your rulers to it.

I think you missed #5 from OP. I was unaware of the discrepancy and it makes the analysis less routine...

gleem · Oct 18, 2019

hutchphd said:

I think you missed #5 from OP. I was unaware of the discrepancy and it makes the analysis less routine...

No I read it before. I still don't see you point.

The neutron discrepancy is between two labs . The OP problem is internal and if in my lab would have to be resolved. I do not see any "good" way to combine the two measurements that are three sd apart made with two different instruments unless you combined all the measurements assuming the two instruments are equivalent and calc a mean and sd from that and sweep the issue under the rug. But I don't think that is honest.

Another approach is to let an unbiased person(s) look at the data and data collection process and evaluate it. Maybe they can see the problem.

hutchphd · Oct 18, 2019

I agree.

Dale · Oct 18, 2019

gleem said:

Such a standard will also have an uncertainty associated with it. This uncertainty should be considered systematic.

Again, uncertainty is no longer classified this way, but you are correct that it is an uncertainty that is evaluated by other methods.

gleem said:

I would add it to the statistical uncertainty.

I agree. Specifically, add the variances, not the standard deviations.

kelly0303 · Oct 18, 2019

gleem said:

No I read it before. I still don't see you point.

The neutron discrepancy is between two labs . The OP problem is internal and if in my lab would have to be resolved. I do not see any "good" way to combine the two measurements that are three sd apart made with two different instruments unless you combined all the measurements assuming the two instruments are equivalent and calc a mean and sd from that and sweep the issue under the rug. But I don't think that is honest.

Another approach is to let an unbiased person(s) look at the data and data collection process and evaluate it. Maybe they can see the problem.

I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.

kelly0303 · Oct 18, 2019

Dale said:

I agree. Specifically, add the variances, not the standard deviations.

Could you elaborate a bit about this? I am not sure I understand what you mean

Dale · Oct 18, 2019

kelly0303 said:

Could you elaborate a bit about this? I am not sure I understand what you mean

If you have a statistically determined uncertainty ##\sigma_1## and an uncertainty ##\sigma_2## determined by other methods, then the combined uncertainty is ##\sigma_c^2=\sigma_1^2+\sigma_2^2## and not ##\sigma_c=\sigma_1+\sigma_2##

kelly0303 · Oct 19, 2019

Dale said:

If you have a statistically determined uncertainty ##\sigma_1## and an uncertainty ##\sigma_2## determined by other methods, then the combined uncertainty is ##\sigma_c^2=\sigma_1^2+\sigma_2^2## and not ##\sigma_c=\sigma_1+\sigma_2##

Oh, sure! My problem in this case is that I am not sure what to use for each of the two. To give some numerical examples ##\mu_1 = 50##, ##\mu_2 = 60##, ##\sigma_1 = 3##, ##\sigma_2 = 4## (the ##\sigma_1## and ##\sigma_2## are not necessarily the ones in your formula), what mean value and error on the mean would you use based on these values?

Dale · Oct 19, 2019

Since you are treating both of these statistically the NIST approach recommended using ANOVA. That will give you a mean and a standard error on the mean. Since the standard deviations are different you would want to apply a correction for heteroskedacity. Most software packages will have one, but you will need to check the documentation.

Vanadium 50 · Oct 19, 2019

Here is what people do. For simplicity, I am assuming both peaks have the same number of entries and the same width; if that's not the case you can weight.

The estimator is the average of the peaks' positions. The uncertainty on the mean is the uncertainty on a single peak added in quadrature with the distance between the peaks.

This slightly overestimates the error, but b4etter that then the reverse.

gleem · Oct 19, 2019

kelly0303 said:

I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.

Your OP made no sense to me unless it was in the same lab and for the same experiment.

Your neutron half life example is poor since the two experimental results are from completely different experiments. Differing results may be disconcerting. As I stated above when two measurements of the same quantity do not agree it "alerts" you to a possible issues wrt one or both experiments. In the case of the neutron half life measurements something might be going on that is overlooked and isn't that the current controversy? Look at the measurements for each experimental technique. The beam measurements have a larger uncertainty and the technique is probably difficult which I can appreciate. They measured the decay of the neutrons by measuring how fast proton + electron are being produced . The trapped neutron measurement had a significantly smaller uncertainty but measured the number of undecayed neutrons, how fast they are disappearing . I do not know how valid it is to average them. However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

So in your case are you talking about experiments measuring the same thing the same way or as in the neutron example using completely different techniques?

WWGD · Oct 19, 2019

kelly0303 said:

I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.

Are you measuring the ( literally) same object , e.g., measuring it in one lab and shipping it to the second , or you're measuring the same line of product , e.g., acme product #213-x1 in both labs separately?

hutchphd · Oct 19, 2019

gleem said:

However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

As mentioned previously they weighted them inversely according to the variance (which is the square of the SD). This is the "correct" way to do such a weighting as can be seen by arbitrarily dividing a homogeneous sample into sub populations of different size.
Whether it is appropriate here is dubious but it is correct in the simple case

kelly0303 · Oct 19, 2019

WWGD said:

Are you measuring the ( literally) same object , e.g., measuring it in one lab and shipping it to the second , or you're measuring the same line of product , e.g., acme product #213-x1 in both labs separately?

I don't have an actual experiment. I am just curious, just for my personal knowledge. Again, using the neutron lifetime, you can assume they are the literally same object (i.e. a neutron), but measured in two different labs.

kelly0303 · Oct 19, 2019

gleem said:

Your OP made no sense to me unless it was in the same lab and for the same experiment.

Your neutron half life example is poor since the two experimental results are from completely different experiments. Differing results may be disconcerting. As I stated above when two measurements of the same quantity do not agree it "alerts" you to a possible issues wrt one or both experiments. In the case of the neutron half life measurements something might be going on that is overlooked and isn't that the current controversy? Look at the measurements for each experimental technique. The beam measurements have a larger uncertainty and the technique is probably difficult which I can appreciate. They measured the decay of the neutrons by measuring how fast proton + electron are being produced . The trapped neutron measurement had a significantly smaller uncertainty but measured the number of undecayed neutrons, how fast they are disappearing . I do not know how valid it is to average them. However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

So in your case are you talking about experiments measuring the same thing the same way or as in the neutron example using completely different techniques?

I don't have an actual experiment. I am just curious based on things I read online about experiments giving different results. I gave the ruler example just for simplicity. So in the formula for the mean you gave, I assume you meant to use the variance, not the standard deviation, right? However, my main issue is with how to calculate the error associated with this mean (again, assume in the case of neutron lifetime, in order to have a clear example).

kelly0303 · Oct 19, 2019

Dale said:

Since you are treating both of these statistically the NIST approach recommended using ANOVA. That will give you a mean and a standard error on the mean. Since the standard deviations are different you would want to apply a correction for heteroskedacity. Most software packages will have one, but you will need to check the documentation.

Thank you for this! I am quite new to statistics in general, but isn't ANOVA a way to check if the two means are statistically different? Does it actually give you the overall mean and error of these measurements?

kelly0303 · Oct 19, 2019

Vanadium 50 said:

Here is what people do. For simplicity, I am assuming both peaks have the same number of entries and the same width; if that's not the case you can weight.

The estimator is the average of the peaks' positions. The uncertainty on the mean is the uncertainty on a single peak added in quadrature with the distance between the peaks.

This slightly overestimates the error, but b4etter that then the reverse.

Thanks a lot for this! I am pretty sure this is what I was looking for! So the error would be ##\sqrt{\sigma_1^2+(\mu_1-\mu_2)^2}##. Is this right? One more question, assuming there are more than 2 peaks (again same number of events and same width), how would you consider the difference between the peaks in this case?

Vanadium 50 · Oct 19, 2019

kelly0303 said:

Is this right?

Yes, that's what is often done. No, that's not right.

As a practical matter, the systematic has to be large for this to happen: quite a bit larger than the statistical error. What to do is part of the art of science rather than the, um, science of science.

If the differences in means were small, say under one standard deviation, I would be disinclined to call this difference a systematic, lest I double count the statistical error. My preference would be to take the data points from Lab 1 and Lab 2 and treat them together and get a single mean and uncertainty.

If the differences are large, say above three standard deviations, I know I have a systematic not under control. I also know that my uncertainty on the magnitude on this probably is not distributed normally, but hey, I have to do something. The difference between two numbers drawn from a normal distribution averages 1.15 standard deviations, so I am being a little generous maybe, but it's also from an unknown distribution, so maybe not.

If they are in-between, I have some thinking to do. It will be strongly influenced by what I know is going on in Lab 1 and Lab 2 and whether I have a reason to suspect a difference. Temperature was measured earlier. If one is measured in Ice City and the other in Sun City and the difference corresponds to a couple of degrees, I would be more inclined to assign a systematic than otherwise.

If I have multiple Labs, the distribution matters. Is one Lab an outlier? Do their results seem to form two distributions? Do they form N distributions where N is the number of Labs? I don't think this can be assigned a systematic in the abstract. Again, part of the art of science and not the science of science.

The neutron lifetime discrepancy is a particularly bad example. People tried very hard to report numbers with the proper uncertainties, and they still don't agree. So your question is "what does statistics say we should do when statistics clearly isn't working" and there's not a better answer than "do something else". The Particle Data Group, for example, increases each experiment's uncertainty by a factor of 1.6 before combining, to account for the fact that something clearly has gone wrong, and this is almost certainly the wrong thing to do. It is also, almost as certainly, the least wrong thing to do given our understanding.

Finally, this is a very, very hard measurement to get right. The top left figure here: http://pdg.lbl.gov/2019/figures/history/figures/history_2018.eps shows the neutron lifetime over time.

How Should Systematic Errors Be Estimated When Experiment Results Differ?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight