Proof of the Vapnik-Chervonenkis inequality: Probabilistic/analysis argument in symmetrization by random signs step.

217 Views Asked by At

I am trying to resolve some compilation queries that arose in parsing the proof of the Vapnik-Chervonenkis inequality, and would appreciate some assistance on clarifying a particular step. The proof comes from a set of lecture notes by Robert Nowak (2009) closely following a similar strategy to the one outlined by Devroye, Györfi and Lugosi (1996) in their proof of the Glivenko-Cantelli theorem.

How is the following inequality justified? Proof extract is supplied further below.

$$\begin{aligned} \mathbb{P} &\left( \sup_{f \in \mathcal{F}} \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \left(\mathbb{I}(f(X_i) \neq Y_i) - \mathbb{I}(f(X_i') \neq Y_i')\right) \right \vert > \frac{\epsilon}{2} \right) \\ &\leq \mathbb{P} \left( \left\{ \sup_{f \in \mathcal{F}} \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \mathbb{I}(f(X_i) \neq Y_i) \right \vert > \frac{\epsilon}{4} \right\} \cup \left\{ \sup_{f \in \mathcal{F}} \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \mathbb{I}(f(X_i') \neq Y_i') \right \vert > \frac{\epsilon}{4} \right\} \right) \\ \end{aligned}$$

Where the $\sigma_i$ are i.i.d. Rademacher random variables with $\mathbb{P}(\sigma_i = +1) = \mathbb{P}(\sigma_i = -1) = 1/2$, and $(X_i, Y_i), (X_i', Y_i') \overset{i.i.d.}{\sim} P(X, Y)$, with $P$ an unknown joint density. We also have that $X \in \mathbb{R}^d$ and $Y \in \{0, 1\}$ and $f \in \mathcal{F}$ is a classifier function $f: \mathbb{R}^d \rightarrow \{0, 1\}$.

Here is an extract of the proof:

enter image description here enter image description here


Here is what I have tried. Working backwards (out of necessity, rather than by choice), for the inequality to hold, then

$$\begin{aligned} &\left\{\sup_{f \in \mathcal{F}} \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \left(\mathbb{I}(f(X_i) \neq Y_i) - \mathbb{I}(f(X_i') \neq Y_i')\right) \right \vert > \frac{\epsilon}{2} \right\} \\ &\subseteq \left\{ \sup_{f \in \mathcal{F}} \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \mathbb{I}(f(X_i) \neq Y_i) \right \vert > \frac{\epsilon}{4} \right\} \cup \left\{ \sup_{f \in \mathcal{F}} \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \mathbb{I}(f(X_i') \neq Y_i') \right \vert > \frac{\epsilon}{4} \right\}. \end{aligned}$$

What I'm finding difficult at this stage is whether the next move is informed by analysis, or probability. On the former, the difficulty for me lies in how to "distribute" the $\sup$ and how to break up the summation on the left hand side to reason about individual events. I was thinking that maybe I could strip away the stochasticity, and a make a deterministic argument by applying the triangle inequality and taking suprema. But that is a fragmentary thought, I've only recently been introduced to analysis, and am not sure how to apply it in a probabilistic setting.

1

There are 1 best solutions below

2
On BEST ANSWER

Your next move is informed by analysis. The assertion $ E\subseteq G\cup H $ is equivalent to the assertion $ G^c\cap H^c \subseteq E^c $, so to justify the claimed set inclusion, it's enough to show that if $\omega\in G^c$ and $\omega\in H^c$, then $\omega\in E^c$. Now in probability the $\omega$'s tend to be omitted from the notation for an event, but here it may be helpful to insert them back in: $\omega\in G^c$ is the same as saying $$\sup _{f\in {\mathcal F}}\frac1n\Big|\sum \sigma_i(\omega){\mathbb 1}(f(X_i(\omega))\ne Y_i(\omega))\Big|\le\frac\epsilon 4$$ which is now a statement involving $\sigma_1(\omega),\ldots,\sigma_n(\omega)$, $X_1(\omega),\ldots,X_n(\omega)$, $Y_1(\omega),\ldots,Y_n(\omega)$, which are real-valued scalars or vectors. Now we can apply the real analysis result $$\sup_f |A(f)-B(f)|\le \sup_f|A(f)|+\sup_f|B(f)|$$ (which follows from the triangle inequality) to see that $\omega\in G^c$ and $\omega\in H^c$ indeed implies that $\omega\in E^c$: just take $$ A(f):=\frac1n\sum\sigma_i(\omega){\mathbb 1}(f(X_i(\omega))\ne Y_i(\omega))$$ and $$ B(f):=\frac1n\sum\sigma_i(\omega){\mathbb 1}(f(X_i'(\omega))\ne Y_i'(\omega)).$$ In probability, this kind of argumentation is a common pattern for showing that one event is a subset of another. Once you get used to the implicit $\omega$ you'll become comfortable with manipulating random variables as if they are real numbers.