Intuition behind Chebyshev's inequality

3.5k Views Asked by At

Is there any intuition behind Chebyshev's inequality or is that only pure mathematics? What strikes me is that any random variable (whatever distribution it has) applies to that.

$$ \Pr(|X-\mu|\geq k\sigma) \leq \frac{1}{k^2}. $$

5

There are 5 best solutions below

3
On BEST ANSWER

The intuition is that if $g(x) \geq h(x) ~\forall x \in \mathbb R$, then $E[g(X)] \geq E[h(X)]$ for any random variable $X$ (for which these expectations exist). This is what one would intuitively expect: since $g(X)$ is always at least as large as $h(X)$, the average value of $g(X)$ must be at least as large as the average value of $h(X)$.

Now apply this intuition to the functions $$g(x) = (x-\mu)^2 ~ \text{and}~ h(x)= \begin{cases}a^2,& |x - \mu| \geq a,\\0, & |x-\mu|< a,\end{cases}$$ where $a > 0$ and where $X$ is a random variable with finite mean $\mu$ and finite variance $\sigma^2$. This gives $$E[(X-\mu)^2] = \sigma^2 \geq E[h(X)] = a^2P\{|X-\mu|\geq a\}.$$ Finally, set $a = k\sigma$ to get the Chebyshev inequality.


Alternatively, consider the variance $\sigma^2$as representing the moment of inertia of the probability mass about the center of mass (a.k.a. mean $\mu$). The total probability mass $M$ in the region $(-\infty, \mu-k\sigma] \cup [\mu+k\sigma, \infty)$ that is far far away from the mean $\mu$ contributes a total of at least $M\cdot (k\sigma)^2$ to the sum or integral for $\sigma^2 = E[(X-\mu)^2]$, and so, since everything else in that sum or integral is nonnegative, it must be that $$\sigma^2 \geq M\cdot (k\sigma)^2 \implies M = P\{|X-\mu| \geq k\sigma\} \leq \frac{1}{k^2}.$$

Note that for a given value of $k$, equality will hold in the Chebyshev inequality when there are equal point masses of $\frac{1}{2k^2}$ at $\mu \pm k\sigma$ and a point mass of $1 - \frac{1}{k^2}$ at $\mu$. The central mass contributes nothing to the variance/moment-of-inertia-about-center-of-mass calculation while the far-away masses each contribute $\left(\frac{1}{2k^2}\right)(k\sigma)^2 = \frac{\sigma^2}{2}$ to add up to the variance $\sigma^2$

0
On

Square integrable variables are not any random variables. They are in fact pretty regular !

Once you know that your variable has a variance, it's natural that the distance to the mean of your variable can be controlled in probability by this variance. Chebyshev's inequality is probably the simplest way to achieve that.

0
On

To me it means:

The further away the random variable from the mean is, the more seldom it becomes. k gives you the number of standard deviations (if taken as a natural number) and the probabilty will automatically be limited by $1/k^2$.

My intuition why this is a meaningful statement for all random variables is the following: The measure of the whole space is limited, namely = 1. You cannot fill up the whole space in the reals with positive values (measure will be inf), so the distribution must vanish on the sides.

0
On

It's useful to view Chebyshev's inequality as more of an application of Markov's inequality which for a nonnegative random variable $X$ and $\alpha > 0$ is given by,

$$ \begin{align} P(X \geq \alpha) \leq \frac{\text{E}(X)}{\alpha} . \end{align} $$

(Notice how we arrive at Chebyshev's inequality by applying Markov's inequality to the event $\{(X - \mu)^2 \geq k^2 \sigma^2 \}$ which is equivalent to $\{|X - \mu| \geq k \sigma \}$ and therefore has the same probability.)

Now the intuition behind Markov's inequality is that there is an implicit relationship between probability and expectation, and that for nonnegative random variables knowing the expected value places certain constraints on the behavior of the tail. That is, if one already knows how large $X$ is on average, then the probability of large values must be controlled or $\text{E}(X)$ will itself be "pulled" towards a larger value.

To illustrate, suppose that $\text{E}(X) = 1$. Is it possible that $P(X \geq 2) > 1/2$? Obviously not, because then $\text{E}(X) > 1$ and we've contradicted ourselves.

0
On

Find the worst case distribution, and everything else must have a lesser probability.

Since all that matters is whether a point is inside or outside the ball of radius $k \sigma$ centered on $\mu$, we should make sure all of the probability mass inside the ball is at $\mu$, and everything outside the ball is exactly on the boundary $|x - \mu| = k \sigma$; doing so would minimize their contribution to the standard deviation, thus letting us place as much mass outside of the ball as we can.

i.e. we should consider the distribution

$$ P(X = x) = \begin{cases} 1 - \rho & x = \mu \\ \rho/2 & x = \mu \pm k \sigma \\ 0 & \text{otherwise} \end{cases} $$

where $\rho = P(|X-\mu| \geq k \sigma)$.

This has mean $\mu$ and standard deviation $k \sigma \sqrt{\rho}$, and thus $\rho = 1/k^2$.

(Making this rigorous would require either proving that this is the worst case. Of course, once we know what the answer should be, it may be easier to prove it more directly)