How to find the outlier (40, 10) in this case using IQR rule?

55 Views Asked by At

Suppose I need to remove the outlier, that is (40, 10) in this case (refer to the plot attached below) using IQR rule, how do I do that?

Compared to the neighbouring points, (40, 10) is definitely an outlier. However,
Q1 = 11.25,
Q3 = 35.75
1.5 * IQR = 1.5 * (Q3 - Q1) = 36.75
Only points with y-val lower than 11.25-36.75 or greater than 35.75+36.75 are considered outliers.
How do I find and remove (40, 10) using IQR rule if I must use IQR rule?

Here's my code:

import pandas as pd
import matplotlib.pyplot as plt

test = pd.DataFrame({'x': range(50), 'y': [i if i != 40 else 10 for i in range(50)]})

plt.figure(**FIGURE)
plt.scatter(test['x'], test['y'], marker='x')
plt.show()

Here's the plot generated from the above code. Please view the plot, this question is irrelevant without.

enter image description here

1

There are 1 best solutions below

0
On BEST ANSWER

@Henry is correct. The point you show is not an outlier among the $x$s nor among the $Y$s. It is an outlier among the residuals from the regression line of $Y$ on $x.$

I do not have access to your data, so here is a somewhat similar simulation illustrated by data sampled using R, along with a regression analysis and a boxplot of the residuals.

Generate data for regression according to the model $Y_i = 3x_i + 10 + e_i,$ where $e_i$ are IID $\mathsf{Norm}(0, \sigma), \sigma = 5.$ An outlier from the regression line is introduced as point $(80,50).$

set.seed(2020)  # for reproduceability
x = 1:100
y = 3*x + 10 + rnorm(100,0, 5)
y[x = 80] = 50

The left panel of the figure below shows the $n=100$ points. Subsequently, the regression line is plotted through the data.

par(mfrow=c(1,2))     # enable two panels per plot
 plot(x, y, pch=20)   # plot data
reg.out = lm(y~x)     # store regression output

Important information about the regression of $Y$ on $x:$ Notice the very small residual at about $-196.$

In the regression equation $Y_i = \alpha x_i + \beta + e_i,$ the estimate of slope $\alpha$ is $\hat\alpha = 2.9251$ (close to $3),$ the estimate of the $y$-intercept $\beta$ is $\hat \beta = 12.3146$ (close to $10),$ and $\sigma^2$ is estimated by $\hat\sigma^2 = 20.81$ (close to $5^2 = 25).$ The outlier, artificially introduced later, interferes (slightly) with estimation. The t tests show that neither slope nor intercept is $0.$

summary(reg.out)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-196.323   -1.107    1.812    4.915   18.487 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.3146     4.1937   2.936  0.00414 ** 
x             2.9251     0.0721  40.572  < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.81 on 98 degrees of freedom
Multiple R-squared:  0.9438,    Adjusted R-squared:  0.9432 
F-statistic:  1646 on 1 and 98 DF,  p-value: < 2.2e-16

In the left panel below, the (blue) regression line $\hat Y = \hat\alpha x_i + \hat\beta$ is plotted through the data. Residuals $r_i = Y_i - (\hat\alpha x_i + \hat \beta)$ show vertical distances between each of the points and the regression line. Values of the $n=100$ residuals are stored in the vector r.

abline(reg.out, col="blue")
r = reg.out$resid

The right panel below shows a boxplot of the 100 residuals. Our artificially introduced outlier-residual is shown at the bottom of the boxplot. The procedure boxplot.stats prints out the value of this residual.

boxplot(r, main="Residuals")
min(boxplot.stats(r)$out)
[1] -196.3228
par(mfrow=c(1,1)  # return to single panel plotting

enter image description here