Removing outliers with standard deviation - is it possible to end up with an empty dataset?

9.1k Views Asked by At

There is a fairly standard technique of removing outliers from a sample by using standard deviation.

Specifically, the technique is - remove from the sample dataset any points that lie 1(or 2, or 3) standard deviations (the usual unbiased stdev) away from the sample's mean. Is it possible with this technique that one ends up removing all points from the dataset. Or, is there a property of sample stdev that prevents this from happening?

2

There are 2 best solutions below

0
On BEST ANSWER

Duh. Its quite simple to show that there is atleast one datapoint from a sample lying within one stddev from the mean.

Proof -

Assume all datapoints are more than one stddev away from the mean. That is -

$|x_i-\mu| > \sigma$, for all $1 \le i \le n$

Then we have,

$\sum\limits_{i=1}^n(x_i-\mu)^2 > n\sigma^2$

which is in contradiction with the definition of $\sigma$ (sample standard deviation).

$(n-1)\sigma^2 = \sum\limits_{i=1}^n(x_i-\mu)^2$

3
On

Interesting question.

For $k>1$, the chevyshev inequality guarantees that we will always have a non-trivial proportion of data points left. If we do this procedure a finite number of times, then some data points will always be left.

We could talk about taking the limit to infinity. At which point, simply show that if there is less than $N=k^2$ points of data, since $\frac{N-1}{N} < 1 - \frac{1}{k^2}$, this implies that we cannot lose any data points in this procedure.

For $k=1$, note that the uniform distribution on 2 points will result in both points lying exactly 1 SD away. Arguably, you'd want to throw this out, and hence we are left with 0 points.

For $k<1$, the uniform distribution on 2 points will result in both points being thrown out. Of course, several other distributions work too, esp those that are very heavy in the tail ends. Note that chevyshev becomes trivial, and hence doesn't apply in this case.