Should I Remove the Outlier Point in My Linear Fit?

BillKet · Apr 5, 2020

Hello I have these points coming from different experiments:
##x = [-0.3, -0.2, -0.09, 0.01, 0.2]##
##y = [-8.15, -5.20, -3.32, 0., 5.65]##
##y_{err} = [0.1, 0.27, 0.35, 0.09, 0.44]##
and I need to fit a straight line to them (based on theoretical arguments). I attached below the obtained fit. I get for the slope ##26.30\pm1.03## . It is quite clear that the for the 3rd points something went wrong in the experiment and they didn't account for some systematics (it is actually the first measured point), as there is no motivation for that point to be that far (about 3 sigma) from the fit. But this is the data and I can't really change it. However, I assume that my value of the slope is biased by that point (especially because the error it has is not that big) and probably it leads to a bigger error on the estimate of the slope. Do you have any advice on what I should do? Should I just use these values? Should I just remove that point by hand? Should I add some lower weight to it in the fit? Thank you!

hutchphd · Apr 5, 2020

Before looking at the data did you suspect the point in question. In retrospect can you specify a probable data collection issue (other than a "bad" value?). You do not specify the source of your error bars on the points.

BillKet · Apr 5, 2020

hutchphd said:

Before looking at the data did you suspect the point in question. In retrospect can you specify a probable data collection issue (other than a "bad" value?). You do not specify the source of your error bars on the points.

Before plotting it, no, I didn't suspected that point is bad. That experiment was done quite a long time ago, so it's difficult to point towards an issue (but most probably it is some systematic error they didn't account for). The error bars come from adding in quadrature the systematic and statistical errors, but I don't have the separate values for systematic and statistical errors.

hutchphd · Apr 5, 2020

BillKet said:

The error bars come from adding in quadrature the systematic and statistical errors, but I don't have the separate values for systematic and statistical errors.

Sorry I don't really know what this means. Were there multiple measurements made etc etc? i e where were the numbers from?

BillKet · Apr 5, 2020

hutchphd said:

Sorry I don't really know what this means. Were there multiple measurements made etc etc? i e where were the numbers from?

It was just one measurement for each data points. It was a counting experiment, from which they got the statistical error ##\sigma_{stat}##, but also some systematics error (having to do with the experimental setup), ##\sigma_{syst}##. The error on the points is ##\sqrt{\sigma_{stat}^2+\sigma_{syst}^2}##.

hutchphd · Apr 5, 2020

I think you just report the values. Why is the error on the fourth point so very small?

BillKet · Apr 6, 2020

hutchphd said:

I think you just report the values. Why is the error on the fourth point so very small?

What do you mean by report the values. Values for what? Also I am not sure about the size of the error bars. I assume in certain cases the counts were just higher so the statistical error was lower and maybe they had a better control on their systematics. But my main question is about the 3rd point. Should I take it into account for the fit? And if so, how?

hutchphd · Apr 6, 2020

All data.
Why would the number of counts give error bars on both axes? Please be specific as to what you did to produce the horizontal and vertical bars for, say, point #1 .

BillKet · Apr 6, 2020

hutchphd said:

All data.
Why would the number of counts give error bars on both axes? Please be specific as to what you did to produce the horizontal and vertical bars for, say, point #1 .

For the purpose of my question you can ignore the error on the x-axis (you can assume it is zero). The counts are just related to the y-error. Also I am still confused by what you mean by "values" and "all data". What I need is just a number: the value of the slope (and its error). Why would I obtain multiple values?

To produce the errors I just picked the values from the papers where they got published. They don't give a detailed description of the systematics (at least not in a way in which you can figure out if some systematics are missing, as it is the case for the 3rd point probably). So just to clarify it, I pick the value of the points and associated error from some previously published papers, from several experiments, they are not produced by me, so even if they might be wrong, these are the numbers I need to work with. But regardless of how they are produced (which again, there is nothing I can do about), all I want to know is what is the best way to deal with that 3rd point which is obviously miss-measured. Thank you!

hutchphd · Apr 6, 2020

Sorry I missunderstood: these are data from previously published experimental results.
That makes it all the more imperative to include every data point. All data (what is unclear about that?). Including point 3. That would be points 1,2,3,4,5
(I get the feeling you would like a different answer...and so my attempt at humor)

BillKet · Apr 6, 2020

hutchphd said:

Sorry I missunderstood: these are data from previously published experimental results.
That makes it all the more imperative to include every data point. All data (what is unclear about that?). Including point 3. That would be points 1,2,3,4,5
(I get the feeling you would like a different answer...and so my attempt at humor)

oh sorry, I got confused by "report the values". I need to report just one value: the slope. I assume you meant I have to take into account all the values for the fit, right?

hutchphd · Apr 6, 2020

I would.

Vanadium 50 · Apr 6, 2020

There is a standard for when to throw out data, called Chauvenet's criterion. It is likely you would remove the point. However, given the strength of point 4 and the weakness of point 3, I doubt it will make much difference.

Re: 26.30 ± 1.03, you mean 26 ± 1 or possibly 26.3 ± 1.0. 26.30 ± 1.03 means you know the central value to 4% but the error to 1%. I don't think you do.

hutchphd · Apr 6, 2020

Having educated myself a bit on Chauvenet's principle, I would still choose to keep the data point for such a small data set regardless. It is to some degree an arbitrary decision and the definition of a criterion does not make it standard practice. I have never seen it used.

WWGD · Apr 6, 2020

Just in case, @BillKet . I guess both x,y are Random variables? Or the x's are just points you chose to measure? Can you tell us what the variables x, y measure?

BillKet · Apr 7, 2020

WWGD said:

Just in case, @BillKet . I guess both x,y are Random variables? Or the x's are just points you chose to measure? Can you tell us what the variables x, y measure?

The x-axis represents some fixed property of the measured element. It is a bit complicated to describe, but you can think of it as something associated to the element that you can't modify (such as the atomic number). The y-axis can be thought of as a transition frequency of that measured element. So x's are points we choose to measure.

gmax137 · Apr 7, 2020

I plugged the data into excel (ignoring the errors) and did a linear trendline. With all 5 points it shows a slope of 27.3. Ignoring the third point, it shows a slope of 27.2. Is this a significant difference? How much can you expect from only five points?

Should I Remove the Outlier Point in My Linear Fit?

Attachments

1. What is an outlier point in a linear fit?

2. Why should I consider removing an outlier point in my linear fit?

3. How do I determine if a data point is an outlier in a linear fit?

4. What are the potential consequences of removing an outlier point in a linear fit?

5. Are there any alternatives to removing an outlier point in a linear fit?

Similar threads

Hot Threads

Recent Insights