Estimating a systematic error

In summary, the conversation discusses estimating systematic error in an experiment where two groups measure the length of a cube using different rulers. After a large number of measurements, two peaks appear in the data, suggesting a systematic error in one of the measurements. However, it is difficult to determine which ruler is causing the error and repeating the experiment is expensive. The conversation explores ways to present the results and determine the best estimate of the cube's length, taking into account the uncertainty caused by the two peaks. Further characterization of the measurement apparatus and careful recalibration may be necessary to accurately determine the length.
  • #1
kelly0303
557
33
Hello! I am a bit confused about estimating the systematic error (I think it is systematic) from an experiment. Here is a (simplified) description of it. Assume that 2 groups measure the length of a cube with 2 different rulers, which, due to some effects give slightly different results (for example they are made of different materials and they have different lengths due to thermal expansion). Assume that they associate the same error ##dx## to each of their measurement (based on the grading of the ruler) and for each group the same number of measurements is made. After a big number of measurements the data is presented in a histogram and 2 peaks appear, with a similar standard deviation (although this is not that important) but 2 clearly separated means ##\mu_1## and ##\mu_2##. Obviously one of the measurements is not right (or maybe both), so I think that this difference can be considered a systematic error. Now, if I had just one peak, the average length would be the mean of the gaussian peak and the error would be ##\sigma/\sqrt{N}##, where N is the number of measurements and ##\sigma## their standard deviation. However, now I need to take into account the fact that I don't know which one of the 2 peaks is right so I need to attach an error associated to that, too. How should I do it properly? Thank you!
 
Physics news on Phys.org
  • #2
To say anything definitive you need to further characterize that population of rulers. Absent that they will simply swamp the other errors, making the N measurements more or less irrelevant. (In the absence of further information all the 2N measurements are equivalent and the bifurcation will be the predominant error source ) In practice it should mean either carefully recalibrate the ruler and perhaps discard the bad data (for cause) or get good rulers and repeat the experiment.
 
  • #3
hutchphd said:
To say anything definitive you need to further characterize that population of rulers. Absent that they will simply swamp the other errors, making the N measurements more or less irrelevant. (In the absence of further information all the 2N measurements are equivalent and the bifurcation will be the predominant error source ) In practice it should mean either carefully recalibrate the ruler and perhaps discard the bad data (for cause) or get good rulers and repeat the experiment.
My questions was more general, assuming that repeating the experiment is very expensive and figuring out which one of the measuring devices (which in a real experiment can be very complex) is doing something wrong is difficult to figure out. The ruler example was just to make my question more clear, without going into physics details. So, assuming that is the data, and I have no further way to clean it, and I also know from theoretical calculations that I must have just one peak (or in this case, the length can't have 2 values), how would i present my result in a paper? Of course in a paper I will explain the situation and why I get these results, but I also must give a best estimate for the mean value of the length and the error associated to it. What is the best way to do so? Thank you!
 
  • #4
The issues here are accuracy (repeatability} and precision (~calibration). Your accuracy exceeds your precision.
Can you characterize the measurement apparatus independently ex post facto somehow? Are the problems likely to be durable and therefore amenable to such analysis? Is it really only two "rulers"? I don't know any magic that passes the "sniff" test without compelling information
How do you know they were not both doing some thing wrong. That is my first question for you to answer...
 
  • #5
hutchphd said:
The issues here are accuracy (repeatability} and precision (~calibration). Your accuracy exceeds your precision.
Can you characterize the measurement apparatus independently ex post facto somehow? Are the problems likely to be durable and therefore amenable to such analysis? Is it really only two "rulers"? I don't know any magic that passes the "sniff" test without compelling information
How do you know they were not both doing some thing wrong. That is my first question for you to answer...
I am not sure they are not both wrong. This is what I said in the first post, too. They might both still be wrong, but one still needs to give a best estimate of the length value, given these measurements i.e. if you publish a paper, you won't say the length of the cube is, say, ##11 \pm 1## cm and ##14 \pm 1## cm. The cube has just one length, so you have to combine them into just one value, which would obviously have a big error associated to it. My question is, how do you calculate the mean value and the error in this case? Just to give an actual real life example (the reason I avoided this is because it might be too complicated and have details that I am not aware of, but here it is): The lifetime of the neutron has 2 significantly different measured values: https://www.quantamagazine.org/neutron-lifetime-puzzle-deepens-but-no-dark-matter-seen-20180213/. One is ##888\pm 2.1## seconds, the other one is ##879.3 \pm 0.75## seconds. No one knows why they are different, so one of them at least is wrong (unless there is some unknown physics at play). Yet if you go on wikipedia you will see that the mean lifetime of the neutron is given as ##881.5(15)## seconds. So the two measurements were combined in a single value, even if they are not consistent with one another. Again, I am not sure about all the details of these experiments, so my example might not be very useful, but what I am trying to say is that, having an experiment measuring something and getting 2 very different values is not at all uncommon (same thing with proton radius, but that might have been almost solved). Also, you have no idea as to why they are different so if you want to give just one value, you have to combine the 2 measurements. My question is, what is the best way to do that?
 
  • #6
kelly0303 said:
Hello! I am a bit confused about estimating the systematic error (I think it is systematic) from an experiment. Here is a (simplified) description of it. Assume that 2 groups measure the length of a cube with 2 different rulers, which, due to some effects give slightly different results (for example they are made of different materials and they have different lengths due to thermal expansion). Assume that they associate the same error ##dx## to each of their measurement (based on the grading of the ruler) and for each group the same number of measurements is made. After a big number of measurements the data is presented in a histogram and 2 peaks appear, with a similar standard deviation (although this is not that important) but 2 clearly separated means ##\mu_1## and ##\mu_2##. Obviously one of the measurements is not right (or maybe both), so I think that this difference can be considered a systematic error. Now, if I had just one peak, the average length would be the mean of the gaussian peak and the error would be ##\sigma/\sqrt{N}##, where N is the number of measurements and ##\sigma## their standard deviation. However, now I need to take into account the fact that I don't know which one of the 2 peaks is right so I need to attach an error associated to that, too. How should I do it properly? Thank you!
The best reference for handling uncertainty is in this guide from NIST

https://www.nist.gov/sites/default/files/documents/2017/05/09/tn1297s.pdf

See sections 2.2 and 2.3 first. Uncertainty is no longer classified as random or systematic. It is now classified as uncertainty that is evaluated by statistical methods or uncertainty that is evaluated by other means. In this case, although your uncertainty would have formerly been classified as systematic, you are evaluating it by statistical methods.

Section 3 gives some scanty advice for dealing with this type of uncertainty. It suggests using ANOVA, which seems reasonable to me.
 
  • #7
If the uncertainties in each reported number are simply considered RMS sampling errors (call them σ1 and σ2 then it makes sense to weight each mean value by 1/σ2 . (The easy way to think about this is that each σ is proportional to 1/√N where N is the number of measurements in each test). This gives the Wikipedia mean value I believe, and probably makes sense if you insist upon combining values.
How they get the uncertainty I don't know (frankly I never know for certain what that notation even means!) . The number ±1.5 seems grotesquely wrong to me as does ±15.
Were I required to report a value I would choose the smaller value with its uncertainty and footnote the possible discrepancy. The longer value is more likely wrong in my estimation because failure to see decay events will produce a larger lifetime and this is more likely than seeing erroneous events I think. But there is no clean prescription.
 
  • #8
The fact that you get two different estimates with two instruments alerts you to a problem with your measurements. Both must be suspect to start. Instruments are usually accompanied by an accuracy spec from the manufacturer based on some sort of calibration procedure. Ideally that spec should be verified by the user.

In your case you should obtain a certified (standard) length from a trusted agency and compare your rulers to it. Any discrepancy can then be compensated for. Such a standard will also have an uncertainty associated with it. This uncertainty should be considered systematic. I would add it to the statistical uncertainty. Ultimately you have to take into account all factors that contribute to the final value for example both rulers could be fine but give different values depending on the temperature of the object being measured.

My philosophy is that one probably knows the accuracy of his measurements less than he thinks.
 
  • #9
gleem said:
In your case you should obtain a certified (standard) length from a trusted agency and compare your rulers to it.
I think you missed #5 from OP. I was unaware of the discrepancy and it makes the analysis less routine...
 
  • #10
hutchphd said:
I think you missed #5 from OP. I was unaware of the discrepancy and it makes the analysis less routine...

No I read it before. I still don't see you point.

The neutron discrepancy is between two labs . The OP problem is internal and if in my lab would have to be resolved. I do not see any "good" way to combine the two measurements that are three sd apart made with two different instruments unless you combined all the measurements assuming the two instruments are equivalent and calc a mean and sd from that and sweep the issue under the rug. But I don't think that is honest.

Another approach is to let an unbiased person(s) look at the data and data collection process and evaluate it. Maybe they can see the problem.
 
  • #11
I agree.
 
  • #12
gleem said:
Such a standard will also have an uncertainty associated with it. This uncertainty should be considered systematic.
Again, uncertainty is no longer classified this way, but you are correct that it is an uncertainty that is evaluated by other methods.

gleem said:
I would add it to the statistical uncertainty.
I agree. Specifically, add the variances, not the standard deviations.
 
  • #13
gleem said:
No I read it before. I still don't see you point.

The neutron discrepancy is between two labs . The OP problem is internal and if in my lab would have to be resolved. I do not see any "good" way to combine the two measurements that are three sd apart made with two different instruments unless you combined all the measurements assuming the two instruments are equivalent and calc a mean and sd from that and sweep the issue under the rug. But I don't think that is honest.

Another approach is to let an unbiased person(s) look at the data and data collection process and evaluate it. Maybe they can see the problem.
I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.
 
  • #14
Dale said:
I agree. Specifically, add the variances, not the standard deviations.
Could you elaborate a bit about this? I am not sure I understand what you mean
 
  • #15
kelly0303 said:
Could you elaborate a bit about this? I am not sure I understand what you mean
If you have a statistically determined uncertainty ##\sigma_1## and an uncertainty ##\sigma_2## determined by other methods, then the combined uncertainty is ##\sigma_c^2=\sigma_1^2+\sigma_2^2## and not ##\sigma_c=\sigma_1+\sigma_2##
 
  • #16
Dale said:
If you have a statistically determined uncertainty ##\sigma_1## and an uncertainty ##\sigma_2## determined by other methods, then the combined uncertainty is ##\sigma_c^2=\sigma_1^2+\sigma_2^2## and not ##\sigma_c=\sigma_1+\sigma_2##
Oh, sure! My problem in this case is that I am not sure what to use for each of the two. To give some numerical examples ##\mu_1 = 50##, ##\mu_2 = 60##, ##\sigma_1 = 3##, ##\sigma_2 = 4## (the ##\sigma_1## and ##\sigma_2## are not necessarily the ones in your formula), what mean value and error on the mean would you use based on these values?
 
  • #17
Since you are treating both of these statistically the NIST approach recommended using ANOVA. That will give you a mean and a standard error on the mean. Since the standard deviations are different you would want to apply a correction for heteroskedacity. Most software packages will have one, but you will need to check the documentation.
 
  • #18
Here is what people do. For simplicity, I am assuming both peaks have the same number of entries and the same width; if that's not the case you can weight.

The estimator is the average of the peaks' positions. The uncertainty on the mean is the uncertainty on a single peak added in quadrature with the distance between the peaks.

This slightly overestimates the error, but b4etter that then the reverse.
 
  • Like
Likes Dale
  • #19
kelly0303 said:
I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.

Your OP made no sense to me unless it was in the same lab and for the same experiment.

Your neutron half life example is poor since the two experimental results are from completely different experiments. Differing results may be disconcerting. As I stated above when two measurements of the same quantity do not agree it "alerts" you to a possible issues wrt one or both experiments. In the case of the neutron half life measurements something might be going on that is overlooked and isn't that the current controversy? Look at the measurements for each experimental technique. The beam measurements have a larger uncertainty and the technique is probably difficult which I can appreciate. They measured the decay of the neutrons by measuring how fast proton + electron are being produced . The trapped neutron measurement had a significantly smaller uncertainty but measured the number of undecayed neutrons, how fast they are disappearing . I do not know how valid it is to average them. However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

So in your case are you talking about experiments measuring the same thing the same way or as in the neutron example using completely different techniques?
 
  • #20
kelly0303 said:
I never said the measurements are in the same lab. When I said 2 groups I meant in general. It can be or not from the same lab (that is not really relevant to what I want to know). But to make it clear (and more on the line of the neutron lifetime example), say that 2 papers publish the same measurement (made in 2 different labs) and give 2 different values that differ by several sigma. How would you combine (assume you want to make a post on Wikipedia for example?) the 2 numbers into one number with an associated error? Or even simpler, how did they calculate a single value on Wikipedia from the 2 different measurements of the neutron lifetime? Understanding how they did it, might help answer my question automatically.
Are you measuring the ( literally) same object , e.g., measuring it in one lab and shipping it to the second , or you're measuring the same line of product , e.g., acme product #213-x1 in both labs separately?
 
  • #21
gleem said:
However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)
As mentioned previously they weighted them inversely according to the variance (which is the square of the SD). This is the "correct" way to do such a weighting as can be seen by arbitrarily dividing a homogeneous sample into sub populations of different size.
Whether it is appropriate here is dubious but it is correct in the simple case
 
  • #22
WWGD said:
Are you measuring the ( literally) same object , e.g., measuring it in one lab and shipping it to the second , or you're measuring the same line of product , e.g., acme product #213-x1 in both labs separately?
I don't have an actual experiment. I am just curious, just for my personal knowledge. Again, using the neutron lifetime, you can assume they are the literally same object (i.e. a neutron), but measured in two different labs.
 
  • Like
Likes hutchphd and Dale
  • #23
gleem said:
Your OP made no sense to me unless it was in the same lab and for the same experiment.

Your neutron half life example is poor since the two experimental results are from completely different experiments. Differing results may be disconcerting. As I stated above when two measurements of the same quantity do not agree it "alerts" you to a possible issues wrt one or both experiments. In the case of the neutron half life measurements something might be going on that is overlooked and isn't that the current controversy? Look at the measurements for each experimental technique. The beam measurements have a larger uncertainty and the technique is probably difficult which I can appreciate. They measured the decay of the neutrons by measuring how fast proton + electron are being produced . The trapped neutron measurement had a significantly smaller uncertainty but measured the number of undecayed neutrons, how fast they are disappearing . I do not know how valid it is to average them. However it does appear to me that they may have taken the reciprocal uncertainty weighted average of the two thus weighing the more accurate experiment more heavily.

(X1/σ1 +X2/σ2)/(1/σ1+1/σ2)

So in your case are you talking about experiments measuring the same thing the same way or as in the neutron example using completely different techniques?
I don't have an actual experiment. I am just curious based on things I read online about experiments giving different results. I gave the ruler example just for simplicity. So in the formula for the mean you gave, I assume you meant to use the variance, not the standard deviation, right? However, my main issue is with how to calculate the error associated with this mean (again, assume in the case of neutron lifetime, in order to have a clear example).
 
  • #24
Dale said:
Since you are treating both of these statistically the NIST approach recommended using ANOVA. That will give you a mean and a standard error on the mean. Since the standard deviations are different you would want to apply a correction for heteroskedacity. Most software packages will have one, but you will need to check the documentation.
Thank you for this! I am quite new to statistics in general, but isn't ANOVA a way to check if the two means are statistically different? Does it actually give you the overall mean and error of these measurements?
 
  • #25
Vanadium 50 said:
Here is what people do. For simplicity, I am assuming both peaks have the same number of entries and the same width; if that's not the case you can weight.

The estimator is the average of the peaks' positions. The uncertainty on the mean is the uncertainty on a single peak added in quadrature with the distance between the peaks.

This slightly overestimates the error, but b4etter that then the reverse.
Thanks a lot for this! I am pretty sure this is what I was looking for! So the error would be ##\sqrt{\sigma_1^2+(\mu_1-\mu_2)^2}##. Is this right? One more question, assuming there are more than 2 peaks (again same number of events and same width), how would you consider the difference between the peaks in this case?
 
  • #26
kelly0303 said:
Is this right?

Yes, that's what is often done. No, that's not right. 😉

As a practical matter, the systematic has to be large for this to happen: quite a bit larger than the statistical error. What to do is part of the art of science rather than the, um, science of science.

If the differences in means were small, say under one standard deviation, I would be disinclined to call this difference a systematic, lest I double count the statistical error. My preference would be to take the data points from Lab 1 and Lab 2 and treat them together and get a single mean and uncertainty.

If the differences are large, say above three standard deviations, I know I have a systematic not under control. I also know that my uncertainty on the magnitude on this probably is not distributed normally, but hey, I have to do something. The difference between two numbers drawn from a normal distribution averages 1.15 standard deviations, so I am being a little generous maybe, but it's also from an unknown distribution, so maybe not.

If they are in-between, I have some thinking to do. It will be strongly influenced by what I know is going on in Lab 1 and Lab 2 and whether I have a reason to suspect a difference. Temperature was measured earlier. If one is measured in Ice City and the other in Sun City and the difference corresponds to a couple of degrees, I would be more inclined to assign a systematic than otherwise.

If I have multiple Labs, the distribution matters. Is one Lab an outlier? Do their results seem to form two distributions? Do they form N distributions where N is the number of Labs? I don't think this can be assigned a systematic in the abstract. Again, part of the art of science and not the science of science.

The neutron lifetime discrepancy is a particularly bad example. People tried very hard to report numbers with the proper uncertainties, and they still don't agree. So your question is "what does statistics say we should do when statistics clearly isn't working" and there's not a better answer than "do something else". The Particle Data Group, for example, increases each experiment's uncertainty by a factor of 1.6 before combining, to account for the fact that something clearly has gone wrong, and this is almost certainly the wrong thing to do. It is also, almost as certainly, the least wrong thing to do given our understanding.

Finally, this is a very, very hard measurement to get right. The top left figure here: http://pdg.lbl.gov/2019/figures/history/figures/history_2018.eps shows the neutron lifetime over time.
 
Last edited:
  • Like
  • Informative
Likes kelly0303, Dale and hutchphd

1. What is a systematic error?

A systematic error is a type of error that occurs consistently in the same direction, leading to a deviation from the true value. It is caused by flaws in the experimental design or equipment, and can result in inaccurate measurements or data.

2. How is a systematic error different from a random error?

A random error is a type of error that occurs randomly and is not consistent, leading to a spread of values around the true value. Unlike systematic errors, random errors can be reduced by taking multiple measurements and calculating an average.

3. How can systematic errors be identified and corrected?

Systematic errors can be identified by comparing the results from different methods or equipment, or by repeating the experiment with different settings. To correct for systematic errors, adjustments can be made to the experimental design or equipment, or a correction factor can be applied to the data.

4. What are some examples of systematic errors?

Some examples of systematic errors include zero error in measuring instruments, faulty calibration of equipment, and environmental factors such as temperature or humidity affecting the experiment. Human error, such as misreading a scale or recording data incorrectly, can also lead to systematic errors.

5. How can the impact of a systematic error be minimized?

To minimize the impact of a systematic error, it is important to identify and correct for it as early as possible. This can be done by carefully designing the experiment, calibrating equipment regularly, and taking multiple measurements. It is also important to document any potential sources of systematic error and their effects on the results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
879
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
778
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
793
  • Set Theory, Logic, Probability, Statistics
2
Replies
37
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
839
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Back
Top