Can you always decrease the variance by removing outliers?

305 Views Asked by At

Consider a finite set of real numbers. Let its variance be $V$. Let the highest number be $h$ and the lowest number be $l$, and let's assume $l < h$.

Let $x$ be an arbitrary number with $l < x < h$.

Now create a new set by removing one element equal to $h$ and replacing it with $x$. Call the variance of this new set $V_h$.

Also create a new set by removing one element equal to $l$ and replacing it with $x$. Call the variance of this new set $V_l$.

Is it true (and is there a proof!) of the statement: "either $V_l < V$ or $V_h < V$"?

To show that both are not always true, consider the set ${0,0,0,1}$ and replace one of the $0$'s with $0.9$. The variance goes up.

2

There are 2 best solutions below

0
On BEST ANSWER

I think I may have found the proof.

Number the points $p_1,p_2,...,p_N$ from lowest to highest.

Replace one of the extreme points, $p_E$ where $E \in \{1,N\}$, by some $x$ with $p_1<x<p_N$. Call the new set ${q_1,...,q_N}$.

Now

$$ Var(q) = \frac{1}{N} \sum (q_i - \bar{q})^2 \le \frac{1}{N} \sum (q_i - \bar{p})^2 $$

since the variance is minimized around the mean.

So, we want to prove that

$$ \sum (q_i - \bar{p})^2 < Var(p) $$

i.e. that

$$ \sum (q_i - \bar{p})^2 < \sum (p_i - \bar{p})^2. $$

This reduces to

$$ (x-\bar{p})^2 < (p_E - \bar{p})^2 $$

since the sets only differ in one element.

Now choose $E$ so that $x-\bar{p}$ has the same sign as $p_E - \bar{p}$. As $p_E$ was an extreme point, we know that the above inequality holds.

1
On

Call your data set $\{x_k\}_{k=1}^N$. Let's call the mean $\mu=\frac{1}{N}\sum x_k$, and write the variance $V=\sum (x_k-\mu)^2$. Since we assume $l=\min x_k<h=\max x_k$, take $x_{\min}=l,x_\max=h$ and proceed by cases:

Pick $\epsilon>0$ small enough so that $x=x_\min+\epsilon\leq\mu$ (or, equivalently, $\epsilon\leq \frac{h-l}{2}$). Then obviously $$x_\min+\epsilon-\mu<x_\min-\mu\Longrightarrow(x_\min+\epsilon-\mu)^2<(x_\min-\mu)^2\Longrightarrow V_l<V.$$

Now by considering the case that $h>x=x_\min+\epsilon\geq\mu$, we may simply rewrite $x=x_\max-\tilde{\epsilon}$ where now $\tilde{\epsilon}=h-l-\epsilon\leq\frac{h-l}{2}$, and a similar line of reasoning as above will yield $V_h<V$.