Is Jensen's inequality being used in the conclusion of the proof of the Vapnik-Chervonenkis inequality?

80 Views Asked by At

I am trying to resolve some compilation queries that arose in parsing the proof of the Vapnik-Chervonenkis inequality, and would appreciate some assistance on clarifying a particular step. The proof comes from a set of lecture notes by Robert Nowak (2009) closely following a similar strategy to the one outlined by Devroye, Györfi and Lugosi (1996) in their proof of the Glivenko-Cantelli theorem.

How is the following inequality justified? Proof extract is supplied further below

$$\begin{aligned} \mathbb{E}_{D_n} \Bigg[ s(\mathcal{F}, n) \, \sup_{f \in \mathcal{F}} \,& \mathbb{P}_{\sigma_1, \, \dots \, ,\, \sigma_n} \left( \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \mathbb{I}(f(X_i) \neq Y_i) \right \vert > \frac{\epsilon}{4} \, \middle | \, D_n \right) \Bigg] \\ &\leq s(\mathcal{F}, n) \, \mathbb{E}_{D_n} \left[\sup_{f \in \mathcal{F}} \, \mathbb{P}_{\sigma_1, \, \dots \, ,\, \sigma_n} \left( \frac{1}{n} \left \vert \sum^n_{i=1} \sigma_i \mathbb{I}(f(X_i) \neq Y_i) \right \vert > \frac{\epsilon}{4} \, \middle | \, D_n \right) \right]. \\ \end{aligned}$$

Supplying context on the terms not defined above,

  • $\sigma_i$ are i.i.d. Rademacher random variables taking values $\pm 1$ equiprobably, $\mathbb{P}(\sigma_i = +1) = \mathbb{P}(\sigma_i = -1) = 1/2$.
  • $D_n$ is a training sample consisting of $n$ i.i.d. observations $(X_1, Y_1), \dots, (X_n, Y_n) \sim P$ drawn from an unknown joint density $P$.
  • $f: \mathbb{R}^d \rightarrow \{0, 1\}$ is a binary classifier function residing in a function class, $\mathcal{f} \in \mathcal{F}$.
  • $s(\mathcal{F}, n)$ is the $n$-th shatter coefficient, or growth function, of the function class $\mathcal{F}$, which is the maximum number of labelling sequences the class $\mathcal{F}$ induces over $n$ training points $x_1, \dots, x_n \in \mathbb{R}^d$.
  • That is, $s(\mathcal{F}, n) = \max_{x_1, \, \dots \,, \, x_n \in \, \mathbb{R}^d} \vert N_{\mathcal{F}}(x_1, \dots, x_n) \vert$, where $N_\mathcal{F}(x_1, \dots, x_n) = \{f(x_1,), \dots, f(x_n) \in \{0, 1\}^n, f \in \mathcal{F}\}$

Here is an extract of the proof:

enter image description here


Some fragmentary thoughts. Initially I thought that this perhaps relied on the use of Jensen's inequality. However, if that were the case, then one would have to make some kind of convexity argument, perhaps using the properties of the growth function/shatter coefficient $s(\mathcal{F}, n)$. However it is not clear to me how one would go about doing this if it is indeed the right approach.

Given that there are a number of other typos in this proof, it might also be the case that this is a mistake and the "$\leq$" should read "$=$". However, I am not sure.