Consider the binary cross entropy loss of the posterior $\eta(x) = \mathbb{P}(Y=1 | X=x)$:
$$\mathcal{L}_n(\eta) = \frac{1}{n} \sum_{i=1}^n Y_i\log(\eta(X_i)) + (1-Y_i)\log(1-\eta(X_i))$$
Assume that $\eta$ has finite bracketing entropy. Specifically, assume that for every $\eta$ there exists $\eta_L \leq \eta \leq \eta_U$ with $\eta_L, \eta_U \in \mathcal{F_\delta}$ where $\mathcal{F_\delta}$ is a finite class of functions and $\eta_L, \eta_U$ are close in the sense that: $\mathbb{E}|\eta_L- \eta_U| \leq \delta$.
Show the following upper bound on the deviation of the empirical likelihood from the expectation for all $\varepsilon < \varepsilon_0$ where $\varepsilon_0$ is some positive constant that may depend on the $\eta$ being considered:
$$\mathbb{P}(\mathcal{L}_n - \mathbb{E}[\mathcal{L}_n] > \varepsilon) \leq e^{-n\varepsilon^2/16}$$
I expect I should be able to apply a Chernoff bound here but I am having trouble evaluating or bounding the moment generating function of $\mathcal{L}_n$:
$$\mathbb{P}(\mathcal{L}_n - \mathbb{E}[\mathcal{L}_n] > \varepsilon) \leq \inf_t \frac{\mathbb{E}\exp{t\mathcal{L}_n}}{\exp{t\epsilon}}$$
In this case I don't know anything about $\eta$ other than it's range. In particular, $\eta(x)$ can be $0$ for some values of $x$, in which case $\log(\eta(x))$ is unbounded. How do I deal with this? If it's not possible to show such a bound, what additional assumptions do I need to make so that such a bound is possible?
EDIT:
I've tried to apply a symmetrization argument, but I am still missing a bound on the deviation of each term in the summation.
As described in the question, we will use a chernoff bound:
$$\mathbb{P}(\mathcal{L}_n - \mathbb{E}[\mathcal{L}_n] > \varepsilon) \leq \inf_t \frac{\mathbb{E}\exp{t(\mathcal{L}_n - \mathbb{E}[\mathcal{L}_n])}}{\exp{t\epsilon}}$$
Let's focus on the numerator. By Jensen's:
$$\mathbb{E}\exp{t(\mathcal{L}_n - \mathbb{E}[\mathcal{L}_n])} \leq \mathbb{E}\exp{t(\mathcal{L}_n - \mathcal{L}_n')}$$
where $\mathcal{L}_n'$ is an independent sample that is identically distributed to $\mathcal{L}_n$. Now, since $\mathcal{L}_n - \mathcal{L}_n'$ is distributed as $\sigma(\mathcal{L}_n - \mathcal{L}_n')$ where $\sigma$ is a rademacher random variable we can take the expectation conditonal on $\sigma$ and apply the fact that $\sigma$ is subgaussian:
$$\leq \Pi_i \mathbb{E}\exp{(t/n)^2(Z_i - Z_i')^2/8}$$
where $Z_i, Z_i'$ are individual terms in the summation forming $\mathcal{L}_n, \mathcal{L}_n'$ respectively.
All that remains to do is to upperbound $(Z_i - Z_i')$ and then optimize $t$ to get the upper bound. However, it's not clear to me that I can bound this difference.
EDIT 2:
It should also be possible to apply Theorem 2 from https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/ by showing that each term in the likelihood has bounded variance.
EDIT 3:
I'm pretty sure we need additional conditions on the class of bracketing functions (either bounded second moment or bounded a.s.) but I am not sure how to show that those conditions are required. For example, for the weak law of large numbers we use the condition of finite variance and arrive at a similar subgaussian convergence rate (we could also arrive at the same rate with absolute integrability condition by applying the truncation method). Although in this case, the convergence is subexponential since we only need to specify some $\epsilon_0$ where the condition holds.
This question is from problem 15.4 in [1].
References
[1] Devroye, Luc, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition. Vol. 31. Springer Science & Business Media, 2013