Should I Remove the Outlier Point in My Linear Fit?

In summary, the conversation discusses fitting a straight line to experimental data points and the issue of a potential outlier affecting the slope of the line. The source of error bars on the data points is unclear and there is uncertainty about how to handle the outlier point in the fit. The conversation also mentions that the data was taken from previously published papers and there is no way to change the numbers. The main question is whether or not to include the outlier point in the fit and how to account for it.
  • #1
BillKet
312
29
Hello I have these points coming from different experiments:
##x = [-0.3, -0.2, -0.09, 0.01, 0.2]##
##y = [-8.15, -5.20, -3.32, 0., 5.65]##
##y_{err} = [0.1, 0.27, 0.35, 0.09, 0.44]##
and I need to fit a straight line to them (based on theoretical arguments). I attached below the obtained fit. I get for the slope ##26.30\pm1.03## . It is quite clear that the for the 3rd points something went wrong in the experiment and they didn't account for some systematics (it is actually the first measured point), as there is no motivation for that point to be that far (about 3 sigma) from the fit. But this is the data and I can't really change it. However, I assume that my value of the slope is biased by that point (especially because the error it has is not that big) and probably it leads to a bigger error on the estimate of the slope. Do you have any advice on what I should do? Should I just use these values? Should I just remove that point by hand? Should I add some lower weight to it in the fit? Thank you!
 

Attachments

  • fit.png
    fit.png
    6.3 KB · Views: 219
Physics news on Phys.org
  • #2
Before looking at the data did you suspect the point in question. In retrospect can you specify a probable data collection issue (other than a "bad" value?). You do not specify the source of your error bars on the points.
 
  • Like
Likes WWGD
  • #3
hutchphd said:
Before looking at the data did you suspect the point in question. In retrospect can you specify a probable data collection issue (other than a "bad" value?). You do not specify the source of your error bars on the points.
Before plotting it, no, I didn't suspected that point is bad. That experiment was done quite a long time ago, so it's difficult to point towards an issue (but most probably it is some systematic error they didn't account for). The error bars come from adding in quadrature the systematic and statistical errors, but I don't have the separate values for systematic and statistical errors.
 
  • #4
BillKet said:
The error bars come from adding in quadrature the systematic and statistical errors, but I don't have the separate values for systematic and statistical errors.
Sorry I don't really know what this means. Were there multiple measurements made etc etc? i e where were the numbers from?
 
  • #5
hutchphd said:
Sorry I don't really know what this means. Were there multiple measurements made etc etc? i e where were the numbers from?
It was just one measurement for each data points. It was a counting experiment, from which they got the statistical error ##\sigma_{stat}##, but also some systematics error (having to do with the experimental setup), ##\sigma_{syst}##. The error on the points is ##\sqrt{\sigma_{stat}^2+\sigma_{syst}^2}##.
 
  • #6
I think you just report the values. Why is the error on the fourth point so very small?
 
  • #7
hutchphd said:
I think you just report the values. Why is the error on the fourth point so very small?
What do you mean by report the values. Values for what? Also I am not sure about the size of the error bars. I assume in certain cases the counts were just higher so the statistical error was lower and maybe they had a better control on their systematics. But my main question is about the 3rd point. Should I take it into account for the fit? And if so, how?
 
  • #8
All data.
Why would the number of counts give error bars on both axes? Please be specific as to what you did to produce the horizontal and vertical bars for, say, point #1 .
 
  • #9
hutchphd said:
All data.
Why would the number of counts give error bars on both axes? Please be specific as to what you did to produce the horizontal and vertical bars for, say, point #1 .
For the purpose of my question you can ignore the error on the x-axis (you can assume it is zero). The counts are just related to the y-error. Also I am still confused by what you mean by "values" and "all data". What I need is just a number: the value of the slope (and its error). Why would I obtain multiple values?

To produce the errors I just picked the values from the papers where they got published. They don't give a detailed description of the systematics (at least not in a way in which you can figure out if some systematics are missing, as it is the case for the 3rd point probably). So just to clarify it, I pick the value of the points and associated error from some previously published papers, from several experiments, they are not produced by me, so even if they might be wrong, these are the numbers I need to work with. But regardless of how they are produced (which again, there is nothing I can do about), all I want to know is what is the best way to deal with that 3rd point which is obviously miss-measured. Thank you!
 
  • #10
Sorry I missunderstood: these are data from previously published experimental results.
That makes it all the more imperative to include every data point. All data (what is unclear about that?). Including point 3. That would be points 1,2,3,4,5
(I get the feeling you would like a different answer...and so my attempt at humor)
 
  • #11
hutchphd said:
Sorry I missunderstood: these are data from previously published experimental results.
That makes it all the more imperative to include every data point. All data (what is unclear about that?). Including point 3. That would be points 1,2,3,4,5
(I get the feeling you would like a different answer...and so my attempt at humor)
oh sorry, I got confused by "report the values". I need to report just one value: the slope. I assume you meant I have to take into account all the values for the fit, right?
 
  • #12
I would.
 
  • #13
There is a standard for when to throw out data, called Chauvenet's criterion. It is likely you would remove the point. However, given the strength of point 4 and the weakness of point 3, I doubt it will make much difference.

Re: 26.30 ± 1.03, you mean 26 ± 1 or possibly 26.3 ± 1.0. 26.30 ± 1.03 means you know the central value to 4% but the error to 1%. I don't think you do.
 
  • Informative
Likes hutchphd
  • #14
Having educated myself a bit on Chauvenet's principle, I would still choose to keep the data point for such a small data set regardless. It is to some degree an arbitrary decision and the definition of a criterion does not make it standard practice. I have never seen it used.
 
  • #15
Just in case, @BillKet . I guess both x,y are Random variables? Or the x's are just points you chose to measure? Can you tell us what the variables x, y measure?
 
  • Like
Likes hutchphd
  • #16
WWGD said:
Just in case, @BillKet . I guess both x,y are Random variables? Or the x's are just points you chose to measure? Can you tell us what the variables x, y measure?
The x-axis represents some fixed property of the measured element. It is a bit complicated to describe, but you can think of it as something associated to the element that you can't modify (such as the atomic number). The y-axis can be thought of as a transition frequency of that measured element. So x's are points we choose to measure.
 
  • Like
Likes WWGD
  • #17
I plugged the data into excel (ignoring the errors) and did a linear trendline. With all 5 points it shows a slope of 27.3. Ignoring the third point, it shows a slope of 27.2. Is this a significant difference? How much can you expect from only five points?
 
  • Like
Likes WWGD and hutchphd

1. What is an outlier point in a linear fit?

An outlier point in a linear fit is a data point that does not follow the general trend of the data and is significantly different from the other data points. It can greatly affect the slope and intercept of the linear fit and may indicate a mistake in the data or an unusual occurrence.

2. Why should I consider removing an outlier point in my linear fit?

Removing an outlier point can help improve the accuracy of the linear fit by reducing the influence of extreme values. It can also help to better represent the relationship between the variables being studied.

3. How do I determine if a data point is an outlier in a linear fit?

There are various methods for identifying outliers, such as using statistical measures like the z-score or standard deviation, or visually inspecting a scatter plot. It is important to carefully consider the context of the data and the potential reasons for the outlier before deciding to remove it.

4. What are the potential consequences of removing an outlier point in a linear fit?

Removing an outlier point can significantly change the results of the linear fit, potentially altering the interpretation of the relationship between the variables. It is important to carefully consider the potential impact and whether it is necessary before removing an outlier.

5. Are there any alternatives to removing an outlier point in a linear fit?

Yes, there are alternative methods for dealing with outliers in a linear fit, such as transforming the data or using robust regression techniques. It is important to consult with a statistician or carefully research these methods before deciding on the best approach for your data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
884
  • Introductory Physics Homework Help
Replies
1
Views
3K
  • Introductory Physics Homework Help
Replies
1
Views
2K
  • Advanced Physics Homework Help
Replies
1
Views
5K
Back
Top