# Linear Regression of estimated measures / outliers

## Main Question or Discussion Point

Hi all, I would like to understand the theory for determining outliers in the following scenario.

Let's say I am to fit a linear model to the data of house size v. sale price for a particular location.

And let's say I have a fairly good linear relationship, as house size increases, so does price.

But then I have a mansion that only sold for $80,000. Well, if it is just one house, I could safely ignore it as an outlier. But if , in my data, I have 50 mansions that all sold for under$100,000, I may have to suspect there is something about a large house that makes it very undesirable to the particular community, and that I should choose a model that reflects this.

My question is - is there a mathematical method as to when to label data as significant or not.

I have thought about creating the %95 confidence interval for a measure, in this case the measure would be the mean price of mansions. Clearly if I only have one mansion in my data...well I don't even know how to construct a 95% CI for such a small sample size, but if i had one, I would leave it out. If I had 50 mansions that sold for a low price, and I had a fairly tight 95% CI around this low price....then at some point I would say it is significant.

Any help on further understanding this would be much appreciated.

## Answers and Replies

Related Set Theory, Logic, Probability, Statistics News on Phys.org
Stephen Tashi
Science Advisor
I can't tell whether you just want someone to give you a ton of links about methods of removing outliers or whether you are trying to solve a specific problem.

To get an answer to a specific problem, you must have a considerable amount of "given" information (which, in real world problems, mean you must make assumptions). You must also be able to state clearly what you are trying to accomplish.

Based on other threads, many non-statisticians who mention "confidence intervals" in their posts are really talking about "credible intervals" or "prediction intervals". So I hesitate to comment on the method you outlined till that is cleared up.

statdad
Homework Helper
It is never appropriate to eliminate an outlier from a problem simply because it is an outlier. You can
* temporarily remove it and rerun the analysis to gauge the influence the outlier is having on the problem
* investigate it to see whether there was some error in transcription (writing $80000 rather than$800,000, for instance)
* investigate whether the peculiar data point(s) is (are) from a population you don't intend to study

If you find an error (transcription, wrong population, problem with the recording, etc) it is acceptable to remove the outlier and proceed. Absent that, you should leave it in. Removing it simply because you don't like it is not an acceptable statistical practice.

Have you tried a robust fit, or even looking at the regression diagnostics to see what influence the data you mention exhibits?