Determine Outliers in a non-normal distribution

  • Thread starter Thread starter ExNihilo
  • Start date Start date
  • Tags Tags
    Distribution
AI Thread Summary
To determine upper outliers in a non-normal distribution of page views per IP address, the Interquartile Range (IQR) method is suggested as a potential solution. The dataset shows a significant disparity in page views, with web crawlers generating far more hits than typical users. The discussion emphasizes using the median rather than the mean to establish a threshold for identifying outliers, particularly in a "cliff-shaped" distribution. Concerns are raised about the risk of identifying too many outliers if the distance from quartiles is not set equally. Constructing a boxplot is recommended as a visual tool to identify outliers effectively.
ExNihilo
Messages
33
Reaction score
0

Homework Statement



I would like to determine the upper outliers in a dataset where the distribution is NOT normal. The dataset represents a the number of page viewed for each IP Address. Basically, when a web page is viewed by human users, the IP address has very few hits (1 to 3). While IP addresses of Web crawlers make a lot of page hit.

- IP 1 = 700 views
- IP 2 = 650 views
...
- IP n1 = 50 views
- IP n2 = 45 views
...
IP n3 = 3 views
IP n4 = 2 views
IP n5 = 1 view

The sample represents a few thousand unique IP, each with its page view value. It is very possible that different IP Address made the same number of page viewed. I would like to use a statistical method to determine a threshold value which separates the crawlers from normal users.


Thank you very much in advance for any advice.


Homework Equations


(none)


The Attempt at a Solution


The data distribution is not normal. I am not sure if the method using Standard Deviation would apply. Searching on the net, a possible solution
could be the "Interquartil Range" (IQR) method: http://krishnadagli.blogspot.com/2008/05/learning-statistics-using-rdetecting.html

I am not sure if this could apply well to my scenario. Can you please confirm or complete?
 
Physics news on Phys.org
The you mention will work, but looking at the vast differences in sizes of the measurements you may find a huge number of outliers.
If you have access to a program that creates boxplots of numerical data (Minitab, R) the outliers will appear as asterisks at the upper end of the boxplot: the upper whisker will extend only as high as it can go without locating an outlier.

Just a comment: for data that are normally distributed there is a link between this method and the mean and standard deviation. The first and third quartiles of a normal distribution are roughly 0.67448 standard deviations above and below the mean, so the IQR for the normal distribution is roughly 1.34896 times the standard deviation.
Saying an outlier is any value more than 1.5 IQR from the first or the third quartile is the same as saying the limit is about 2.02 standard deviations - for normally distributed data, at least, the IQR method and the usual standard deviation methods are comparable.

I know your data is not normal, but the comparison in the normal case can be helpful in seeing what the motivation is.
 
Hi,

Thank you for your advice. The data sample I am dealing with is closer of a "cliff shape"
than a bell curve. I didn't start any serious work yet so I cannot confirm how accurate the IQR method could be.

However I plan to use the median (not mean) value as a control value. Just a way to give the date frequency some influence. Something like if 1.5 IQR is way above Median value then it is a reliable threshold. Do you think that the Median value could be used to arbitrate usefully in making a decision? If so, can you suggest some directions where I can develop further?

Thanks in advance.
 
ExNihilo said:
Hi,

Thank you for your advice. The data sample I am dealing with is closer of a "cliff shape"
than a bell curve. I didn't start any serious work yet so I cannot confirm how accurate the IQR method could be.

However I plan to use the median (not mean) value as a control value. Just a way to give the date frequency some influence. Something like if 1.5 IQR is way above Median value then it is a reliable threshold. Do you think that the Median value could be used to arbitrate usefully in making a decision? If so, can you suggest some directions where I can develop further?

Thanks in advance.

The one problem I have with that is this: the method I outlined, where you determine the location of outliers by using a fixed distance above the third quartile and below the firstquartile ``protects'' the central 50% of the data from being identified as an outlier (as does your idea) but also makes works on the assumption that outliers begin a specific distance from those quartiles - the cutoffs for large and small outliers are equally distant from the upper/lower quartiles. Your procedure would not necessarily set the distance equally far from each quartile, only equally distant from the center of the data. In short, my fear is that you would identify too many values as outliers.
Have you constructed boxplot? (It would be a good choice since most programs that provide them show outliers on the plot)
 
I tried to combine those 2 formulas but it didn't work. I tried using another case where there are 2 red balls and 2 blue balls only so when combining the formula I got ##\frac{(4-1)!}{2!2!}=\frac{3}{2}## which does not make sense. Is there any formula to calculate cyclic permutation of identical objects or I have to do it by listing all the possibilities? Thanks
Since ##px^9+q## is the factor, then ##x^9=\frac{-q}{p}## will be one of the roots. Let ##f(x)=27x^{18}+bx^9+70##, then: $$27\left(\frac{-q}{p}\right)^2+b\left(\frac{-q}{p}\right)+70=0$$ $$b=27 \frac{q}{p}+70 \frac{p}{q}$$ $$b=\frac{27q^2+70p^2}{pq}$$ From this expression, it looks like there is no greatest value of ##b## because increasing the value of ##p## and ##q## will also increase the value of ##b##. How to find the greatest value of ##b##? Thanks
Back
Top