I am having difficulty with intuitive understanding of a definition in an exposition of the proof of the Vapnik Chervonenkis inequality in the notes of Robert Nowak (2009). The proof strategy is taken from the yellow book by Devroye, Györfi and Lugosi (1996).
Context.
Step 1: First symmetrization by a ghost sample We are going to show that $$P \left( \sup_{f \in \mathcal{F}} \Bigg \vert \hat{R}_n(f) - R(f) \Bigg \vert > \epsilon \right) \leq 2 P \left( \sup_{f \in \mathcal{F}} \Bigg \vert \hat{R}_n(f) - \hat{R}_n'(f) \Bigg \vert > \frac{\epsilon}{2} \right). $$ Notice that the absolute value term on the right hand side is now symmetric, involving two different empirical risks. Begin by defining $\tilde{f}(D_n) = \tilde{f}$ to be an element of $\mathcal{F}$ such that $\vert \hat{R}_n(\tilde{f}) - R(\tilde{f}) \vert > \epsilon$ if such element exists, otherwise $\tilde{f}$ is an arbitrary element of $\mathcal{F}$. You can think of $\tilde{f}$ as $$\tilde{f} \approx \arg \max_{f \in \mathcal{F}} \Bigg \vert \hat{R}_n(f) - R(f) \Bigg \vert,$$ although this is generally not well defined there might not be an element in $\mathcal{F}$ attaining the maximum (because $\mathcal{F}$ is typically infinite). The formal definition is an attempt of circumventing this technicality that suffices for our purposes. Notice that $\tilde{f}$ is a function of $D_n$ (we dropped the explicit dependence to make the presentation cleaner).
On terms not defined in the extract:
- $\hat{R}_n(f) = \frac{1}{n} \sum^n_{i=1} \mathbb{I}(f(X_i) \neq Y_i)$ is the empirical risk under a training sample $D_n = (X_1, Y_1), \dots, (X_n, Y_n)$
- Similarly, $\hat{R}'_n(f)$ is the empirical risk under a ghost training sample $D_n' = (X'_1, Y'_1), \dots, (X'_n, Y'_n)$
- $R(f) = \mathbb{E}[l(f(X), Y)] = P(f(X) \neq Y)$ is the risk.
- $\mathcal{F}$ is a function class whose cardinality is not countable.
Questions.
I am comfortable with the idea that we are formally defining "$\tilde{f}$ as an element of $\mathcal{F}$ such that $\vert \hat{R}_n(\tilde{f}) - R(\tilde{f}) \vert > \epsilon$ if such element exists, otherwise $\tilde{f}$ is an element of $\mathcal{F}$."
But what does the author mean when he supplies the intuition that we "can think of $\tilde{f}$ as
$$\tilde{f} \approx \arg \max_{f \in \mathcal{F}} \Bigg \vert \hat{R}_n(f) - R(f) \Bigg \vert,$$
and his explanation that there might not be an element in $\mathcal{F}$ attaining the maximum (because $\mathcal{F}$ is typically infinite)"?
In other words, how can we have a maximum of a quantity, but then have no argument that attains this maximum?
In the original notes, the extract I supplied actually reads
Begin by defining $\tilde{f}(D_n) = \tilde{f}$ to be an element of $\mathcal{F}$ such that $\vert \hat{R}_n(f) - R(f) \vert > \epsilon$ if such element exists...
but I think this is a typo and it should be what I have corrected it to in my type up above with $\tilde{f}$.
This definition can be found in slightly different, but essentially the same language on page 211 of the linked book, and a similar unanswered question on this is also
Theorem 12.4 in "A Probabilistic Theory of Pattern Recognition"