Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Way of distinguishing the outliers

  1. Jan 26, 2012 #1
    Hi guys,

    Am hoping to tickle your guys' brains. I have a bunch of sets of data. Some are large (+50) some are quite small (less than 10). I would like a way to separate the elements that cluster closely to the mean of their respective set from the outliers of their set. The idea is to get a sense of what values are "typical" and which are "not typical" in each set. Put another way, which values are like the others and which are not like the others in the set.

    I've tried assuming normality and taking all the values that fall within 10% of the mean as the "typical" values. Problem is that the distributions of my small data sets are far from normal.

    Anyone have any suggestions? Anyone know a method that might work well?

  2. jcsd
  3. Jan 26, 2012 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    My suggestion is that you shouldn't get lobotomized by statistical procedures. If you have some expert knowledge about the data then bring it to bear, don't just thumb through a collection of statistical procedures.

    Statistics must be based on applying probability. You can't apply it unless you know or assume some specific process for how the data is generated. For example, what are the sources of the "non typical" data points? Are they clerical errors where digits are transposed? Are they something like genetic mutations? or contamination of blood samples by improper handling?

    Once you have a procedure to remove outliers, what are you going to do with the data that remains? What quantities will be estimated from the data? What decisions will be made based on the analysis?
  4. Jan 27, 2012 #3


    User Avatar
    Science Advisor

    Hey sfspitfire23 and welcome to the forums.

    To build on the good advice of Stephen, its important to put the data into context.

    For example if you are measuring heights and you get a data value thats 500cm tall, then in the context of your data that will most likely be removed. It may actually be that someone really is 500cm tall and then you can't just throw it out because it is not erroneous. If however you have decided to focus purely on the common case and that data point screws up everything else, then yes you would remove it.

    You have to do this kind of thing when you are evaluating the data: you can't just chuck data out because its an outlier. You have to analyze that in the context of your experiment and get an understanding of the kinds of ranges your data will take on, and of those ranges what makes sense in the context of what you are trying to do.

    Also in terms of "typical" and "non-typical" cutting data points that have extremely high variance may not be the best way. The best way to determine is something is "typical" is to analyze the process and see if the data obtained is a good "random sample" which means that it represents the overall data rather highly in the context of your experiment.

    This is why you need to consult with an expert as Stephen Tashi pointed out above: you have to do this with all statistical analyses if you want accurate results and inferences.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook