Mathematical impact of outliers on accuracy of models

Click For Summary

Discussion Overview

The discussion centers on the impact of outliers on the accuracy of predictive models, particularly in the context of statistical analysis and regression techniques. Participants explore various methods for identifying and handling outliers, as well as the implications of these methods on model performance.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants suggest that the impact of outliers is model-dependent, particularly in regression models based on normally distributed data.
  • There is a proposal to rerun models with and without outliers to assess their influence, though this approach may not always be valid.
  • One participant emphasizes the need for heuristics to identify and delete gross outliers before refining the model, noting that this method can lead to issues if the heuristics are incorrect.
  • Another participant raises concerns about the validity of simply deleting outliers without understanding their cause, highlighting the challenges of identifying outliers in high-dimensional data.
  • Robust statistical analysis methods are mentioned as an alternative to traditional methods, aiming to reduce the influence of outliers while still providing interpretable results.
  • A question is raised about determining the probability of a high-leverage point being influential, with a response indicating that while all points are influential, not all have equal impact.

Areas of Agreement / Disagreement

Participants generally agree that the treatment of outliers is complex and context-dependent, with multiple competing views on the best approaches to analyze and manage them. The discussion remains unresolved regarding the most effective methods for handling outliers in predictive modeling.

Contextual Notes

Limitations include the dependence on model assumptions, the difficulty of identifying outliers in high-dimensional spaces, and the potential for misleading conclusions when outliers are removed without proper justification.

Galteeth
Messages
69
Reaction score
1
Is there a general approach to calculating the impact outliers have on the accuracy of one's (predictive) model?
 
Last edited:
Physics news on Phys.org
Galteeth said:
Is there a general approach to calculating the impact outliers have on the accuracy of one's (predictive) model?

I think your question is too general. You need to describe your model.

If your model is based on regression against approximately normally distributed data, the influence of outliers is well understood.

I have seen data analysts rerun the model with and without the outlier points.
 
As wofsy said, the question is far too general. There is no one technique for analyzing what outliers will do / have done. Editing outliers is one commonly used technique. Sensors do occasionally go out to lunch. Transmission errors can create huge outliers. Just about the only thing one can do with a 1020 sigma outlier is to delete it. This suggests a refinement on the approach wofsy described in his post. Use some heuristic to delete gross outliers, run the model, delete statistical outliers that the gross heuristics didn't catch, and re-run the model.

This doesn't always work because it assumes that the heuristics and model are basically correct. Example: The ozone hole over Antarctica was initially discovered by ground observations rather than by satellite observations because of the overaggressive use of this technique on the satellite data.
 
The question and answers are dancing around the topic of robust statistical analysis methods. Deleting outliers is one way to deal with them, but unless you know that they are due to errors in measurement (sensors going haywire) eliminating them simply because they are outliers is not a valid statistical procedure.
It is also important to note these things:
outliers are rather easy to find in low dimensional problems, but extremely difficult in high dimensional problems.
in regression, points of high leverage may not appear as outliers in the traditional sense of large residuals - in severe situations the regression line may pass through them, so the residual is zero.

the point of a robust analysis is to use a process that yields results that can be interpreted in ways similar to the traditional least-squares (normal distribution assumption based) methods but which are not as easily influenced by departures from the hypothesized model as the traditional methods might be.

perhaps a too long comment, but discussing "tossing out data" in general can lead to dangerous things.
 
The partial leverage article was useful. Thanks for all the responses. What i was trying to get at was, is there a general means to determine the probability of a high-leverage point being an influential point?
 
Galteeth said:
The partial leverage article was useful. Thanks for all the responses. What i was trying to get at was, is there a general means to determine the probability of a high-leverage point being an influential point?

No, not to find the probability. If you are working with regression, doing a search on regression diagnostics will provide advice on some numerical measures of the severity of a leverage point on your fit.

Note that all points in statistics are influential - but not all are equally influential, in good or bad ways, so looking for "influential points" may not lead to much that is useful
 
Ok, thanks, that answered my question.
 

Similar threads

  • · Replies 24 ·
Replies
24
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
5K
  • · Replies 13 ·
Replies
13
Views
5K
Replies
14
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
14K