Bound on the difference between the true risk of an empirical risk minimizer $\hat{h}$ and an oracle $\bar{h}$.

Question

Bound on the difference between the true risk of an empirical risk minimizer $\hat{h}$ and an oracle $\bar{h}$.

94 Views Asked by Bumbble Comm At 02 Apr 2026 - 1:26

In Lecture 3 on Concentration Inequalities in Philippe Rigollet's Mathematics of Machine Learning course on MIT OCW, there is the following theorem, which is the Theorem on the first page of this document. A large portion of the setup I include is not included in the original theorem, since it was introduced in the previous lecture.

Let $H = \{h_1,...,h_M\}$ be a finite set of binary classifiers.

Let $\{(X_1,Y_1),...,(X_n,Y_n)\}$ be a set of n independent draws from a joint distribution $P_{X,Y}$

Let $R(h) = P(h(X) \neq Y)$ and let $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n{1(h(X_i) = Y_i)}$ be an estimator of $R(h)$.

Let $\hat{h} \in \arg \max_{h\in H}{\hat{R}(h)}$ be called the empirical risk minimizer, and let $\bar{h} \in \arg \max_{h\in H}{R(h)}$ be called the oracle.

Then the empirical risk minimizer $\hat{h}$ satisfies $R(\hat{h}) \leq R(\bar{h}) + \sqrt{\frac{2\log{\frac{2M}{\delta}}}{n}}$ with probability $1-\delta$

The proof of this part of the theorem is easy for me to understand.

Since $\hat{R}(\hat{h}) \leq \hat{R}(\bar{h})$ by definition, we have that

$\hat{R}(\hat{h}) \leq R(\bar{h}) + [\hat{R}(\bar{h}) - R(\bar{h})] + [R(\hat{h}) - \hat{R}(\hat{h})]$ (Inequality 1)

The last two terms can each be bounded using a previous theorem (which is an application of Hoeffding's Theorem) which states that

$P(\text{for all } h \in H\text{ : } |\hat{R}(h) - R(h)| \leq \sqrt{\frac{\log{\frac{2M}{\delta}}}{2n}}) = P(\max_{h \in H}{|\hat{R}(h) - R(h)|} \leq \sqrt{\frac{\log{\frac{2M}{\delta}}}{2n}}) = 1 - \delta$

Hence the conclusion.

What I don't understand is why it is also the case that $E[R(\hat{h})] \leq R(\bar{h}) + \sqrt{\frac{2\log{2M}}{n}}$

The proof of this second part in the lecture is as follows:

If we show $E[\max_{h \in H}{|\hat{R}(h) - R(h)|}] \leq \sqrt{\frac{\log{2M}}{2n}}$ , then by Inequality 1 we're done.

Let $Z_j = \hat{R}(h_j) - R(h_j)$ for $j \in \{1,...,M\}$

We have:

$E[\max_j{Z_j}] = \frac{1}{s}\log\exp{E[s\max_j{Z_j}]} \leq \frac{1}{s}\log E[\exp{s\max_j{Z_j}}]$ by Jensen's Inequality.

Then we have,

$\frac{1}{s}\log E[\exp{s\max_j{Z_j}}] \leq \frac{1}{s}\log \sum_{j=1}^{2M}{E[\exp{sZ_j}]}$

The first thing I'm confused about is that I think instead of summing over $1 \leq j \leq 2M$, it should be over $1 \leq j \leq M$ since there are only $M$ $Z_j$'s, but when you do that you no longer have $\sqrt{\frac{\log{2M}}{2n}}$ as the upper bound, instead you (should) get $\sqrt{\frac{\log{M}}{2n}}$.

The second thing I'm confused about is that the next step is: $\frac{1}{s}\log\sum_{j=1}^{2M}{E[\exp{sZ_j}]} \leq \frac{1}{s}\log{(2M\exp{(\frac{s^2}{8n})})}$

This supposedly follows from Hoeffding's Lemma, which states that if $Z$ is a random variable with expected value 0 that is bounded above by real valued $b$ and bounded below by real valued $a$, then for all $s > 0$,

$E[e^{sZ}] \leq \frac{(b-a)^2s^2}{8}$.

But as far as I can tell, for all $Z_j$, $b-a = 1$, not $\frac{1}{\sqrt{n}}$, which I think is the only way you can accomplish this step.

Thanks in advance.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2023-05-27 15:22:47

For the first point, I do not know why it is done like that because i think it is ok to say that $$\mathbf{E}[\exp{s \underset{j}{\max}Z_j}]=\mathbf{E}[\underset{j}{\max}\exp{s Z_j}]\leq \sum_{i=1}^{M}\mathbf{E}[\exp{s Z_j}] $$

For the second point, you need to realize the following equality

$$Z=\hat{R}(h)-R(h)=\frac{1}{n}\sum_{i=1}^{n}1(h(X_{i})\neq Y_i)-\mathbf{E}[1(h(X_i)\neq Y_i)]$$

and, now you need to bound $\mathbf{E} [\exp{(sZ)}]$. For this purpose, you can use the The bounded differences inequality (see Concentration inequalities by Boucheron,Massart and Lugosi chapter 6) to get that $$\mathbf{E} [\exp{(sZ)}]\leq \exp(\frac{s^{2}}{8n}).$$

This inequality implies the final statement of your post. To finish the proof, you can optimize the right-hand side of the inequality w.r.t to s, and you will find the final result.

I hope it answers your question, and it is clear.

Bound on the difference between the true risk of an empirical risk minimizer $\hat{h}$ and an oracle $\bar{h}$.

There are 1 best solutions below

Related Questions in REAL-ANALYSIS

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions