While reading through the proof, I realized that one part of the proof I took for granted was that Hoeffding's lemma expanded on the exponent portion of the inequality after the reparametrization of the bounds:
Note, I will use different variable names. $a$ is the $\theta$, and $x$ is the $u$: https://en.wikipedia.org/wiki/Hoeffding%27s_lemma
They took the function $e^{-ax}(1-a+ae^x)$, and redefined it in terms of $e^{f(x)}$ where $f(x) = -ax + log(1-a+ae^x)$, and did a taylor expansion on the top, around $x = 0$, to get $e^{1/2(a(1-a))x^2}$, which translated to the final result after some trivial observations.
However, I don't understand what prompted the expansion on $f(x)$, rather than the entire function itself. Can someone elaborate on this?
Part of the motivation for this approach is the more general Chernoff approach to large deviation bounds, where, by Markov's inequality, one has (for $t>0$):
$$\mathbb{P}(X>x) = \mathbb{P}(e^{tX}>e^{tx}) = \le \frac{\mathbb{E}[e^{tX}]}{e^{tx}} = \exp(-[tx-C(t)])$$
with $C(t)=\log \mathbb{E}[e^{tX}]$, the cumulant-generating function of $X$.
To proceed, one then calculates the convex conjugate of $C$:
$$r(x) = \sup_{t>0}[tx-C(t)]$$
to deduce that $\mathbb{P}(X>x) \le \exp(-r(x))$ (by taking the $t$ which attains the supremum above).
The idea of this standard proof of Hoeffding's inequality is to:
The choice to do this by a Taylor expansion is partially motivated by the fact that $C(t)$ is convex (this is true for all cumulant-generating functions), and as such, we have information about its derivatives and curvature.