IQR Outlier detection

333 Views Asked by At

For outlier detection i used the IQR rule. My question is, is this kind of outlier detection only useful then applied to normal distributed data?

My distribution looks like this

1

There are 1 best solutions below

0
On

The most commonly-used IQR outlier-detection rule, designates as an 'outlier' any observation above Q3 + 1.5(IQR) or below Q1 - 1.5(IQR), where Q1 is the lower quartile, Q3 is the upper quartile and IQR = Q3 - Q1. The use of this rule is not restricted to normal data.

However, it is important to realize that some distributions are inherently more likely to show outliers than others. On average, a sample of size $n = 100$ from a normal distribution will typically show about one outlier, as illustrated by the following simulation in R statistical software, counting outliers in 100,000 such samples.

nr.out = replicate( 10^5, length(boxplot.stats(rnorm(100))$out) )
mean(nr.out)
$$ 0.92325

An analogous simulation for samples of size 100 from an exponential distribution shows an average of about 4.85 outliers per sample. This is because exponential distributions have a heavy right tail. It would be unusual for an exponential sample of size 100, not to show at least one outlier.

nr.out = replicate(10^5, length(boxplot.stats(rexp(100))$out ))
mean(nr.out)
## 4.8561
mean(nr.out == 0)
## 0.00952

By contrast, uniform distributions 'have no tails' and samples of size 100 from a uniform distribution almost never show any outliers.

Of course, in real data some outliers may occur because of an unusual departure from population values. For example, data entry error, equipment failure, solar flair, and so on. In certain kinds of experiments, it is only the outliers that are of interest. It was outlier events in the collider at CERN that confirmed the existence of Higgs bosons. Many earthquakes are detected worldwide every day; it is only the extreme outliers that cause damage and so are of interest to the general public.

So, you can use the IQR method of identifying 'outliers' in trying to understand data from almost any population as long as you understand how to interpret what it means to be an outlier in your particular context.