Outlier detection with robust multiple regression model

410 Views Asked by At

I have a set of features (eg, location, income, budget, education) that I use to predict a continuous variable (say, amount spent per day on the internet). I am interested in detecting outliers. I want my model to be very strict and not to be swayed by outliers. I want my outlier detection to be done on the fly. My method is to use all the data I have so far to create a regression and then see if any point are above 3 SD from the residual mean (0). I then re-train the regression using all of the data EXCEPT the points I had just determined to be 3 SD from the residual mean. I continue this for some preset number of iterations, at each turn removing outliers and re-training. Each day I iteratively retrain the model using the new data and all of the old data.

I was wondering if there is a name for this technique-- since it's the first thing I thought of, someone else must have thought of it already?

2

There are 2 best solutions below

0
On BEST ANSWER

The rule that you are talking about is known and it is called 3-$\sigma$ rejection rule. This is the simplest way of robustifying the regression model. You can find anything you are searching for here.

0
On

This rule is related to the three sigma rule of thumb, or three-sigma rule, and it has own wiki page. For a Gaussian distribution, this interval contains $\pm 99.73\%$ of the cases. So your method is often called "three-sigma edit rule"; more precisely, iterated or iterative $3$-sigma rejection scheme in your case.

Doing so, you are using a kind of standard score called $Z$-score or $Z$-value: $$\frac{x_i-\bar{x}}{\sigma}\,$$ compared to $3$ and $-3$, and whose absolute value is called the two sided Grubbs test. In different domains, people use a two-sigma or a five-sigma, depending on the level of confidence they require. Though common, such methods can wor for reasonable symmetric and unimodal distributions. If not, it is likely to be very imprecise, especially if you have few points.

Moreover, it is not considered robust, as the sample mean and the standard deviation estimates are not robust. Some perform it the iterative way, as you do. Some authors prever a more robust alternative based on median statistics: $$0.6745\frac{x-\text{median}(x)}{\text{MAD}(x)}\,$$ where $\text{MAD}$ denotes the Median Absolute Deviation. More details in Finding outliers in numerical data, A survey of outlier detection methodologies (V. J. Hodge, ‎2004),