Need of Uniform convergence in probability?

224 Views Asked by At

In statistical terms, we can define learning of classifier as minimizing the risk $R(f) = E_D(l(f(x),y))$. But as we do not have the distribution information, we minimize the empirical risk $R_n(f) = \frac{1}{n}\sum^{i=n}_{i=1}l(f(x_i),y_i)$. Lets say $f^\ast_n$ is minimizer of empirical risk($R_n(f)$) and $f^\ast$ is minimizer of true risk($R(f)$).

For consistency of the learning i.e. $R(f^\ast_n)\mbox{ converges to }R(f^*)$, uniform convergence in probability of $R_n(f)\mbox{ to }R(f)$ is necessary and sufficient condition. It means $\exists N(\epsilon, \delta) \mbox{ s.t. } P(sup_{f\in\mathcal{F}}|R_n(f) - R(f)| > \epsilon)<\delta, \; \forall{n >N}$ ($N$ here independent of $f$).

If we want to see the sufficient condition for consistency, we can go the following way,

$\begin{align}R(f^\ast_n) - R(f^*) &= R(f^*_n) - R_n(f^*_n) \\&+ R_n(f^*_n) - R_n(f^*) \\&+ R_n(f^*) - R(f^*)\end{align} $

and

$R_n(f^*_n) - R_n(f^*)\le0 \mbox{ ($f^*_n$ is minimizer of empirical risk) so }$

$ \begin{align}R(f^\ast_n) - R(f^*) &\le R(f^*_n) - R_n(f^*_n) + R_n(f^*) - R(f^*)\end{align} $

From triangle inequality:

we can write $\begin{align}|R(f^\ast_n) - R(f^*)| &\le |R(f^*_n) - R_n(f^*_n)| + |R_n(f^*) - R(f^*)|\end{align}$

Lets now take A be a event $|R(f^\ast_n) - R(f^*)| \le \epsilon$, B be $|R(f^*_n) - R_n(f^*_n)|\le \epsilon/2$ and C be $|R_n(f^*) - R(f^*)|\le \epsilon/2$. Thus, $A \supset B \cap C $ and $A^c \subset B^c \cup C^c$. Now $P(A^c)\le P(B^c \cup C^c) \le P(B^c) + P(C^c)$

therefore,

$\begin{align}P(|R(f^\ast_n) - R(f^*)| \ge \epsilon) &\le P(|R(f^*_n) - R_n(f^*_n)| \ge \epsilon/2) + P(|R_n(f^*) - R(f^*)|\ge \epsilon/2)\end{align}$

Question is from Hoeffding inequality we can bound, $P(|R_n(f) - R(f)| > \epsilon/2) < \delta/2$ for any $\delta$ by $\delta/2 = 2exp(-2n\epsilon^2)$ if $l$ takes value in between [0,1] so Why do we need uniform convergence in probability for consistency?

1

There are 1 best solutions below

0
On

I realized why uniform convergence is important. Lets say $D$ is defined on set/space $S$. The fallacy in above argument is because above inequality should hold for common set.

In last equation,

$P(|R_n(f) - R(f)| . \epsilon) \le P(|R(f^*_n) - R_n(f^*_n)|>\epsilon/2) + P(|R_n(f^*) - R(f^*)|>\epsilon/2)$ is needed to hold with probability measure $\delta$. Now lets say $P(|R(f^*_n) - R_n(f^*_n)|>\epsilon/2) < \delta/2$ is true except over event set $A_1$ over which probability measure is less than $\delta/2$ and on event set $A_2$ for $P(|R_n(f^*) - R(f^*)|>\epsilon/2)< \delta/2$. Now as $A_1$ and $A_2$ may not contain a common set. So we can always find a set $(S - A_1)\cap A_2$ for which above inequality may not hold as second term will not satisfy $P(|R_n(f^*) - R(f^*)|>\epsilon/2)< \delta/2$ and error may go above $\epsilon/2 \mbox{ or } \epsilon$. Measure of set $(S - A_1)\cap A_2$ can go upto $1- \delta/2$.