Way of distinguishing the outliers

sfspitfire23 · Jan 26, 2012

Hi guys,

Am hoping to tickle your guys' brains. I have a bunch of sets of data. Some are large (+50) some are quite small (less than 10). I would like a way to separate the elements that cluster closely to the mean of their respective set from the outliers of their set. The idea is to get a sense of what values are "typical" and which are "not typical" in each set. Put another way, which values are like the others and which are not like the others in the set.

I've tried assuming normality and taking all the values that fall within 10% of the mean as the "typical" values. Problem is that the distributions of my small data sets are far from normal.

Anyone have any suggestions? Anyone know a method that might work well?

Thanks

Stephen Tashi · Jan 26, 2012

My suggestion is that you shouldn't get lobotomized by statistical procedures. If you have some expert knowledge about the data then bring it to bear, don't just thumb through a collection of statistical procedures.

Statistics must be based on applying probability. You can't apply it unless you know or assume some specific process for how the data is generated. For example, what are the sources of the "non typical" data points? Are they clerical errors where digits are transposed? Are they something like genetic mutations? or contamination of blood samples by improper handling?

Once you have a procedure to remove outliers, what are you going to do with the data that remains? What quantities will be estimated from the data? What decisions will be made based on the analysis?

chiro · Jan 27, 2012

Hey sfspitfire23 and welcome to the forums.

To build on the good advice of Stephen, its important to put the data into context.

For example if you are measuring heights and you get a data value that's 500cm tall, then in the context of your data that will most likely be removed. It may actually be that someone really is 500cm tall and then you can't just throw it out because it is not erroneous. If however you have decided to focus purely on the common case and that data point screws up everything else, then yes you would remove it.

You have to do this kind of thing when you are evaluating the data: you can't just chuck data out because its an outlier. You have to analyze that in the context of your experiment and get an understanding of the kinds of ranges your data will take on, and of those ranges what makes sense in the context of what you are trying to do.

Also in terms of "typical" and "non-typical" cutting data points that have extremely high variance may not be the best way. The best way to determine is something is "typical" is to analyze the process and see if the data obtained is a good "random sample" which means that it represents the overall data rather highly in the context of your experiment.

This is why you need to consult with an expert as Stephen Tashi pointed out above: you have to do this with all statistical analyses if you want accurate results and inferences.

Way of distinguishing the outliers

SUMMARY

PREREQUISITES

NEXT STEPS

USEFUL FOR

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect