In Lecture 3 on Concentration Inequalities in Philippe Rigollet's Mathematics of Machine Learning course on MIT OCW, there is the following theorem, which is the Theorem on the first page of this document. A large portion of the setup I include is not included in the original theorem, since it was introduced in the previous lecture.
Let $H = \{h_1,...,h_M\}$ be a finite set of binary classifiers.
Let $\{(X_1,Y_1),...,(X_n,Y_n)\}$ be a set of n independent draws from a joint distribution $P_{X,Y}$
Let $R(h) = P(h(X) \neq Y)$ and let $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n{1(h(X_i) = Y_i)}$ be an estimator of $R(h)$.
Let $\hat{h} \in \arg \max_{h\in H}{\hat{R}(h)}$ be called the empirical risk minimizer, and let $\bar{h} \in \arg \max_{h\in H}{R(h)}$ be called the oracle.
Then the empirical risk minimizer $\hat{h}$ satisfies $R(\hat{h}) \leq R(\bar{h}) + \sqrt{\frac{2\log{\frac{2M}{\delta}}}{n}}$ with probability $1-\delta$
The proof of this part of the theorem is easy for me to understand.
Since $\hat{R}(\hat{h}) \leq \hat{R}(\bar{h})$ by definition, we have that
$\hat{R}(\hat{h}) \leq R(\bar{h}) + [\hat{R}(\bar{h}) - R(\bar{h})] + [R(\hat{h}) - \hat{R}(\hat{h})]$ (Inequality 1)
The last two terms can each be bounded using a previous theorem (which is an application of Hoeffding's Theorem) which states that
$P(\text{for all } h \in H\text{ : } |\hat{R}(h) - R(h)| \leq \sqrt{\frac{\log{\frac{2M}{\delta}}}{2n}}) = P(\max_{h \in H}{|\hat{R}(h) - R(h)|} \leq \sqrt{\frac{\log{\frac{2M}{\delta}}}{2n}}) = 1 - \delta$
Hence the conclusion.
What I don't understand is why it is also the case that $E[R(\hat{h})] \leq R(\bar{h}) + \sqrt{\frac{2\log{2M}}{n}}$
The proof of this second part in the lecture is as follows:
If we show $E[\max_{h \in H}{|\hat{R}(h) - R(h)|}] \leq \sqrt{\frac{\log{2M}}{2n}}$ , then by Inequality 1 we're done.
Let $Z_j = \hat{R}(h_j) - R(h_j)$ for $j \in \{1,...,M\}$
We have:
$E[\max_j{Z_j}] = \frac{1}{s}\log\exp{E[s\max_j{Z_j}]} \leq \frac{1}{s}\log E[\exp{s\max_j{Z_j}}]$ by Jensen's Inequality.
Then we have,
$\frac{1}{s}\log E[\exp{s\max_j{Z_j}}] \leq \frac{1}{s}\log \sum_{j=1}^{2M}{E[\exp{sZ_j}]}$
The first thing I'm confused about is that I think instead of summing over $1 \leq j \leq 2M$, it should be over $1 \leq j \leq M$ since there are only $M$ $Z_j$'s, but when you do that you no longer have $\sqrt{\frac{\log{2M}}{2n}}$ as the upper bound, instead you (should) get $\sqrt{\frac{\log{M}}{2n}}$.
The second thing I'm confused about is that the next step is: $\frac{1}{s}\log\sum_{j=1}^{2M}{E[\exp{sZ_j}]} \leq \frac{1}{s}\log{(2M\exp{(\frac{s^2}{8n})})}$
This supposedly follows from Hoeffding's Lemma, which states that if $Z$ is a random variable with expected value 0 that is bounded above by real valued $b$ and bounded below by real valued $a$, then for all $s > 0$,
$E[e^{sZ}] \leq \frac{(b-a)^2s^2}{8}$.
But as far as I can tell, for all $Z_j$, $b-a = 1$, not $\frac{1}{\sqrt{n}}$, which I think is the only way you can accomplish this step.
Thanks in advance.
For the first point, I do not know why it is done like that because i think it is ok to say that $$\mathbf{E}[\exp{s \underset{j}{\max}Z_j}]=\mathbf{E}[\underset{j}{\max}\exp{s Z_j}]\leq \sum_{i=1}^{M}\mathbf{E}[\exp{s Z_j}] $$
For the second point, you need to realize the following equality
$$Z=\hat{R}(h)-R(h)=\frac{1}{n}\sum_{i=1}^{n}1(h(X_{i})\neq Y_i)-\mathbf{E}[1(h(X_i)\neq Y_i)]$$
and, now you need to bound $\mathbf{E} [\exp{(sZ)}]$. For this purpose, you can use the The bounded differences inequality (see Concentration inequalities by Boucheron,Massart and Lugosi chapter 6) to get that $$\mathbf{E} [\exp{(sZ)}]\leq \exp(\frac{s^{2}}{8n}).$$
This inequality implies the final statement of your post. To finish the proof, you can optimize the right-hand side of the inequality w.r.t to s, and you will find the final result.
I hope it answers your question, and it is clear.