Bounding the uniform deviation of the empirical risk from the risk over a finite function class.

108 Views Asked by At

I am having difficulty interpreting the following theorem from here as a probability statement:

Theorem.

For all $\delta$ such that $0 < \delta < 1/2$, with proability at least $1 - \delta$ the following is true $$\lvert \hat{R}_n(f) - {R}(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}} \quad \forall f \in \mathcal{F} \tag{4.3}$$

Where $\hat{R}_n(f) = n^{-1} \sum^n_{i=1} \mathbb{I}(Y_i \neq f(X_i))$ is the empirical risk under a 0/1 loss function from $n$ training data points $(X_i, Y_i)$ drawn from an unknown distribution $p^*$, and $R(f) = \mathbb{E}_{(X, Y) \sim p^*}[\mathbb{I}(Y \neq f(X))]$ is the risk. $f$ is a linear classifier in a function class $\mathcal{F}$ that is finite i.e. $\lvert \mathcal{F} \rvert < \infty$.

Query.

If I were to rewrite this as a probability statement, would the qualifier $\forall f \in \mathcal{F}$ go inside or outside the probability statement?

That is, should the above theorem read:

$$\mathbb{P} \left( \forall f \in \mathcal{F} \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}\right) \geq 1 - \delta \tag{A}$$

Or should it read:

$$\forall f \in \mathcal{F} , \mathbb{P} \left( \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}\right) \geq 1 - \delta \tag{B}$$

Further context - proof of the theorem.

I will now detail each step of the proof, which is fairly standard in the statistics and statistical learning theory literature, together with my understanding.

Step 1 - Bound the deviation of the empirical risk from the risk for a single function $f \in \mathcal{F}$ using a concentration inequality.

For a single function $f \in \mathcal{F}$, as the empirical risk $\hat{R}(f)$ is the sample average of i.i.d. Bernoulli rvs, and the risk $R(f)$ is the theoretical expectation, we have that

$$\mathbb{P} \left( \lvert \hat{R}(f) - R(f) \rvert \geq \epsilon \right) \leq 2\exp(-2n\epsilon^2)$$

Step 2 - Bound the uniform deviation using the union bound.

My understanding is that we require the empirical risk to be a good estimate of the risk in the sense of being 'uniformly close' to the risk over the entire function class $\mathcal{F}$ . And that Hoeffding's inequality by itself is insufficient, because for a particular $f \in \mathcal{F}$, there will be a set of training samples $S_f$ such that $\hat{R}(f) - R(f) \leq \epsilon$ with probability $\mathbb{P}(S_f) \geq 1 - 2\exp(-2n\epsilon^2)$, and the crux is that in general, this set $S_f$ will be different for different functions in $\mathcal{F}$.

We define the event $E_f = \{\lvert \hat{R}(f) - R(f) \rvert \geq \epsilon\}$ and as the function class $\mathcal{F}$ is finite, apply the union bound over $E_f$:

$$\begin{align} \mathbb{P} \left( \sup_{f \in \mathcal{F}} \lvert \hat{R}_n(f) - R(f) \rvert \geq \epsilon \right) &= \mathbb{P}\left( \bigcup_{f \in \mathcal{F}} \{ \lvert \hat{R}_n(f) - R(f) \rvert \geq \epsilon\} \right) \\ &\leq \sum_{f \in \mathcal{F}} \mathbb{P} \left(\lvert \hat{R}_n(f) - R(f) \rvert \geq \epsilon\right) \\ &\leq \sum_{f \in \mathcal{F}} 2 \exp(-2n \epsilon^2) \\ &= 2 \lvert \mathcal{F} \rvert \exp(-2n \epsilon^2) \end{align}$$

Rewriting in terms of $\delta$, that is taking $\delta := 2 \lvert \mathcal{F} \rvert \exp(-2n \epsilon^2)$ and rearranging for $\epsilon$ to get

$$\epsilon = \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}$$

Yields

$$\mathbb{P} \left( \sup_{f \in \mathcal{F}} \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}} \right) \geq 1 - \delta$$

Up to here, I am comfortable with the proof steps - and I read the above line as, "the probability that the maximum deviation between the empirical risk and risk over the entire function class $\mathcal{F}$ lies in $[-C_{\delta}, C_{\delta}]$ is at least $1 - \delta$.

My confusion.

However, where it gets unfamiliar is that the authors state that this proves $(4.3)$, but I am having difficulty trying to figure out whether this means that

$$\begin{align} &\mathbb{P} \left( \sup_{f \in \mathcal{F}} \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}} \right) \geq 1 - \delta \\ & \implies \mathbb{P} \left( \forall f \in \mathcal{F} \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}\right) \geq 1 - \delta \end{align}$$

or whether it means that

$$\begin{align} &\mathbb{P} \left( \sup_{f \in \mathcal{F}} \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}} \right) \geq 1 - \delta \\ & \implies \forall f \in \mathcal{F} , \mathbb{P} \left( \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}\right) \geq 1 - \delta \\ \end{align}$$

1

There are 1 best solutions below

0
On BEST ANSWER

The statement the authors intended is (A) for the following reasons.

Addressing the confusion.

Much of your reluctance to consider $(A)$ as the correct statement amounted to insufficient attentiveness concerning the following basic distinction between:

$$\mathbb{P} \left( \sup_{f \in \mathcal{F}} \lvert \hat{R}_n(f) - R(f) \rvert \geq \epsilon \right) = \mathbb{P} \left( \bigcup_{f \in \mathcal{F}} \{\lvert \hat{R}_n(f) - R(f) \rvert \geq \epsilon \} \right)$$

and

$$\begin{align} \mathbb{P} \left( \sup_{f \in \mathcal{F}} \lvert \hat{R}_n(f) - R(f) \rvert \leq \epsilon \right) &= \mathbb{P} \left( \bigcap_{f \in \mathcal{F}} \{ \lvert \hat{R}_n(f) - R(f) \rvert \leq \epsilon \} \right) \\ &= \mathbb{P} \left( \forall f \in \mathcal{F}, \space \lvert \hat{R}_n(f) - R(f) \rvert \leq \epsilon \right) \end{align}$$

Consider the events within each statement.

  • In the first statement, the event that the "maximum absolute deviation between the empirical risk and the risk over the function class $\mathcal{F}$ exceeds $\epsilon$" is the same as the compound event that "there is at least one function $f$ whose absolute deviation between the empirical risk and risk exceeds $\epsilon$".
  • In the second statement, the event that the "maximum absolute deviation between the empirical risk and the risk over the function class $\mathcal{F}$ does not exceed $\epsilon$" is the same as the compound event that the "absolute deviation between the empirical risk and the risk does not exceed $\epsilon$ for all functions $f \in \mathcal{F}$.

Difference between statements $(A)$ and $(B)$.

The reason why the statement the authors intended is not statement $(B)$ goes back to the argument about the insufficiency of concentration inequalities for understanding generalisation error, and why we are instead interested in bounding uniform deviations. Statement $(B)$ is of the same flavour as a concentration inequality applied to every $f \in \mathcal{F}$, except it is much looser than the one derived using Hoeffding in step 1, due to the inclusion of the $\ln \lvert \mathcal{F} \rvert$ term:

$$ \space \mathbb{P} \left( \lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}} \right) \geq 1 - \delta, \quad \forall f \in \mathcal{F}$$

  • Setting the looseness of the bound aside, this is qualitatively distinct from statement $(A)$.
  • Statement $(B)$ is saying that for every single function $f \in \mathcal{F}$, there exists a set of training samples associated with that function $f$, which we define as $S_f$, such that $\hat{R}_n(f) - R(f) \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}$, and that the probability of encountering this sample exceeds $1 - \delta$, that is, $P(S_f) \geq 1 - \delta$.
  • Even though for all $f \in \mathcal{F}$, the probability $P(S_f) \geq 1 -\delta$, this does not change the fact that in general, each training sample $S_f$ for which $\hat{R}_n(f) - R(f) \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}$ holds is different for different functions $f \in \mathcal{F}$. Hence what is required is something stronger than this.
  • What instead is required is the probability of a set of training samples $S$ for which every absolute deviation $\lvert \hat{R}_n(f) - R(f) \rvert \leq \sqrt{\frac{\ln \lvert \mathcal{F} \rvert + \ln (2 / \delta)}{2n}}$. Only statement $(A)$ captures this sense.

Key take away: The position of the qualifier $\forall f \in \mathcal{F}$ is of the utmost importance.

Other sources.

Statements equivalent to $(A)$, though not written explicitly as probability statements, and with some mild idiosyncratic differences in notation, can be found in:

  • Anthony, M., & Bartlett, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press. p22 doi:10.1017/CBO9780511624216
  • Bousquet O., Boucheron S., Lugosi G. (2004) Introduction to Statistical Learning Theory. In: Bousquet O., von Luxburg U., Rätsch G. (eds) Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science, vol 3176. Springer, Berlin, Heidelberg. p179 https://doi.org/10.1007/978-3-540-28650-9_8