Linear Regression of estimated measures / outliers

Click For Summary
SUMMARY

This discussion focuses on identifying and handling outliers in linear regression analysis, specifically in the context of house size versus sale price. The participant raises concerns about a mansion sold for $80,000 amidst a dataset of higher-priced homes, questioning the mathematical methods for labeling data as significant. Key insights include the importance of constructing a 95% confidence interval for the mean price of mansions and the necessity of investigating outliers rather than removing them arbitrarily. The conversation emphasizes robust statistical practices, including regression diagnostics and the potential use of robust fitting techniques.

PREREQUISITES
  • Understanding of linear regression analysis
  • Familiarity with confidence intervals, particularly 95% confidence intervals
  • Knowledge of outlier detection methods in statistical analysis
  • Experience with regression diagnostics tools
NEXT STEPS
  • Learn about robust regression techniques to handle outliers effectively
  • Study the construction and interpretation of confidence intervals in regression
  • Explore regression diagnostics to assess the influence of outliers on model performance
  • Investigate credible intervals and prediction intervals in statistical modeling
USEFUL FOR

Data analysts, statisticians, and anyone involved in predictive modeling or regression analysis, particularly in real estate or similar fields where outlier detection is critical.

Whenry
Messages
22
Reaction score
0
Hi all, I would like to understand the theory for determining outliers in the following scenario.

Let's say I am to fit a linear model to the data of house size v. sale price for a particular location.

And let's say I have a fairly good linear relationship, as house size increases, so does price.

But then I have a mansion that only sold for $80,000.

Well, if it is just one house, I could safely ignore it as an outlier. But if , in my data, I have 50 mansions that all sold for under $100,000, I may have to suspect there is something about a large house that makes it very undesirable to the particular community, and that I should choose a model that reflects this.

My question is - is there a mathematical method as to when to label data as significant or not.

I have thought about creating the %95 confidence interval for a measure, in this case the measure would be the mean price of mansions. Clearly if I only have one mansion in my data...well I don't even know how to construct a 95% CI for such a small sample size, but if i had one, I would leave it out. If I had 50 mansions that sold for a low price, and I had a fairly tight 95% CI around this low price...then at some point I would say it is significant.

Any help on further understanding this would be much appreciated.
 
Physics news on Phys.org
I can't tell whether you just want someone to give you a ton of links about methods of removing outliers or whether you are trying to solve a specific problem.

To get an answer to a specific problem, you must have a considerable amount of "given" information (which, in real world problems, mean you must make assumptions). You must also be able to state clearly what you are trying to accomplish.

Based on other threads, many non-statisticians who mention "confidence intervals" in their posts are really talking about "credible intervals" or "prediction intervals". So I hesitate to comment on the method you outlined till that is cleared up.
 
It is never appropriate to eliminate an outlier from a problem simply because it is an outlier. You can
* temporarily remove it and rerun the analysis to gauge the influence the outlier is having on the problem
* investigate it to see whether there was some error in transcription (writing $80000 rather than $800,000, for instance)
* investigate whether the peculiar data point(s) is (are) from a population you don't intend to study

If you find an error (transcription, wrong population, problem with the recording, etc) it is acceptable to remove the outlier and proceed. Absent that, you should leave it in. Removing it simply because you don't like it is not an acceptable statistical practice.
Have you tried a robust fit, or even looking at the regression diagnostics to see what influence the data you mention exhibits?
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
3
Views
3K