How to grow the intuition behind this proof?

132 Views Asked by At

My intension is to understand this complete proof step by step in a lucid manner. I am trying few days to capture this, Unfortunatly I am failed. Can I hope to get a nice intuitive explanation of this proof. Looking forward to get such a answer. Thanks in advance.

Paper Link


Theorem 2. Let $\left(x_{1}, y_{t}\right)$ be i.i.d. input-output pairs in $\mathbb{R}^{d} \times[-1,1]$ such that:

  1. The distribution $\mu$ of the covariates $x_{t}$ can be written as $\mu=\sum_{\ell-1}^{k} \alpha_{\ell} \mu_{\ell}$, where each $\mu_{\ell}$ satisfies $c$-isoperimetry and $\alpha_{\ell} \geq 0, \sum_{\ell-1}^{k} \alpha_{\ell}=1$.
  2. The expected conditional varianoe of the output is strictly positive, denoted $\sigma^{2}:=\mathbb{E} \mu[V a r[y \mid x]]>0$. Then one has: $$ \begin{aligned} &\mathbf{P}\left(\exists f \in \mathcal{F}: \frac{1}{n} \sum_{i=1}^{n}\left(y_{t}-f\left(x_{i}\right)\right)^{2} \leq \sigma^{2}-\epsilon\right) \\ &\leq 4 k \exp \left(-\frac{n \epsilon^{2}}{8^{3} k}\right)+2 \exp \left(\log (|\mathcal{F}|)-\frac{e^{2} n d}{9^{4} c L^{2}}\right) \end{aligned} $$ We start with a lemma showing that, to optimize heyond the noise level one must necessarily correlate with the noise part of the labels. In what follows we denote $g(x)=\mathbb{E}[y \mid x]$ for the target function, and $z_{i}=y_{t}-g\left(x_{i}\right)$ for the noise part of the observed labels (namely $y_{1}$ is the sum of the target funetion $g\left(x_{i}\right)$ and the noise term $z_{1}$ ). Lemma 2.1. One has $$ \mathbb{P}\left(\exists f \in \mathcal{F}: \frac{1}{n} \sum_{i=1}^{n}\left(y_{t}-f\left(x_{i}\right)\right)^{2} \leq \sigma^{2}-E\right) \leq 2 \exp \left(-\frac{n t^{2}}{g^{3}}\right)+\mathbb{P}\left(\exists f \in \mathcal{F}: \frac{1}{n} \sum_{i=1}^{n} f\left(x_{i}\right) z_{1} \geq \frac{\epsilon}{4}\right) $$ Proof. The sequence $\left(z_{1}^{2}\right)$ is i.i.d., with mesn $\sigma^{2}$, and such that $\left|z_{1}\right|^{2} \leq 4$. Thus Hoeffding's inequality yielde: $$ \mathbf{P}\left(\frac{1}{n} \sum_{i=1}^{n} z_{1}^{2} \leq \sigma^{2}-\frac{\epsilon}{6}\right) \leq \exp \left(-\frac{n \epsilon^{2}}{g^{3}}\right) $$ On the other hand the sequence $\left(z_{1} g\left(x_{1}\right)\right)$ is i.i.d., with mean $0\left(\right.$ since $\left.\mathbb{E}\left[z_{i} \mid x_{\mathrm{f}}\right]=0\right)$, and such that $\left|z_{1} g\left(x_{1}\right)\right| \leq 2$. Thus Hoeffding's inequality yields: $$ \mathbb{P}\left(\frac{1}{n} \sum_{i-1}^{n} z_{1} g\left(x_{1}\right) \leq-\frac{\epsilon}{6}\right) \leq \exp \left(-\frac{\pi t^{2}}{g^{3}}\right) $$ Let us write $Z=\frac{1}{\sqrt{n}}\left(z_{1}, \ldots, z_{n}\right), G=\frac{1}{\sqrt{n}}\left(g\left(x_{1}\right), \ldots, g\left(x_{n}\right)\right)$, and $F=\frac{1}{\sqrt{n}}\left(f\left(x_{1}\right), \ldots, f\left(x_{n}\right)\right)$. We claim that if $\|Z\|^{2} \geq \sigma^{2}-\frac{6}{6}$ and $\langle Z, G\rangle \geq-\frac{6}{6}$, then for any $f \in \mathcal{F}$ one has $$ \|G+Z-F\|^{2} \leq \sigma^{2}-\epsilon \Rightarrow\langle F, Z\rangle \geq \frac{\epsilon}{4} $$ This claim together with $(2.1)$ and $(2.2$ ) conclude the proof. On the other hand the claim itself directly follows from: $$ \sigma^{2}-\epsilon \geq\|G+Z-F\|^{2}=\|Z+G-F\|^{2}=\|Z\|^{2}+2\langle Z, G-F\rangle+\|G-F\|^{2} \geq \sigma^{2}-\frac{\epsilon}{2}-2\langle Z, F\rangle $$ We can now proceed to the proof of Theorem $2$

Proof. First note that without loss of generality we can assume that the range of any function in $\mathcal{F}$ is included in $[-1,1]$ (indeed elipping the values improves both the fit to any $y \in[-1,1]$ and the Lipechitz constant). We also assume wlog that all funetions in $\mathcal{F}$ are $L$-Lipechitz.

For clarity let us start with the case $k=1$. By the isoperimetry assumption we have that $\sqrt{\frac{d}{c}} \frac{\left(z_{i}\right)-\mathbb{E} / \Omega}{L}$ is 1-sulygamssian. Since $\left|z_{i}\right| \leq 2$, we also have that $\sqrt{\frac{d}{c}} \frac{\left(f\left(x_{i}\right)-\mathbb{E}\left|\int\right| z_{i}\right.}{L}$ is 4 -subgaussian. Moreover, the latter random variable has zero-mean since $\mathbb{E}[z \mid x]=0$. Thas by Proposition ${ }^{2} 2$ we have: $$ \mathbb{P}\left(\sqrt{\frac{d}{c \pi L^{2}}} \sum_{i=1}^{n}\left(f\left(x_{i}\right)-\mathbb{E}[f]\right) z_{i} \geq t\right) \leq 2 \exp \left(-(t / 9)^{2}\right) $$ We rewrite the above as: $$ \mathbf{P}\left(\frac{1}{n} \sum_{i=1}^{n}\left(f\left(x_{i}\right)-\mathbb{E}[f]\right) z_{i} \geq \frac{\epsilon}{8}\right) \leq 2 \exp \left(-\frac{\epsilon^{2} n d}{9^{4} c L^{2}}\right) $$ Since we assumed that the range of the functions is in $[-1,1]$ we have $\mathbb{E}[f] \in[-1,1]$ and hence: $$ \mathbb{P}\left(\exists f \in \mathcal{F}: \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[f] z_{i} \geq \frac{\epsilon}{8}\right) \leq \mathbb{P}\left(\left|\frac{1}{n} \sum_{i=1}^{n} z_{1}\right| \geq \frac{E}{8}\right) $$ (This step is the analog of requiring the labels $y_{1}$ to be well-balanced in the example of perfect interpolation.) By Hoeffding's inequality, the sabove quantity is smaller than $2 \exp \left(-n e^{2} / 8^{3}\right)$ (recall that $\left.\left|z_{t}\right| \leq 2\right)$. Thus we obtain with an union bound: $$ \begin{aligned} \mathbb{P}\left(\exists f \in \mathcal{F}: \frac{1}{n} \sum_{i=1}^{n} f\left(x_{i}\right) z_{i} \geq \frac{\epsilon}{4}\right) & \leq|F| \cdot \mathbb{P}\left(\frac{1}{n} \sum_{i=1}^{n}\left(f\left(x_{i}\right)-\mathbb{E}[f]\right) z_{i} \geq \frac{\epsilon}{8}\right)+\mathbb{P}\left(\left|\frac{1}{n} \sum_{i=1}^{n} z_{i}\right| \geq \frac{\epsilon}{8}\right) \\ & \leq 2|F| \cdot \exp \left(-\frac{\epsilon^{2} n d}{9^{4} c L^{2}}\right)+2 \exp \left(-\frac{n \epsilon^{2}}{8^{3}}\right) \end{aligned} $$ Together with Lemma $2 . .1$ this concludes the proof for $k=1$. We now turn to the case $k>1$. We first sample the mixture component $\ell_{1} \in[k]$ for each data point $i \in[n]$, and we now resson conditioned on these mixture components. Let $S_{\ell} \subset[n]$ be the set of data points sampled from mixture component $\ell \in[k]$, that is $T_{t}, i \in S_{\ell}$, is i.i.d. from $\mu_{\ell}$. We now have that $\sqrt{\frac{d}{c}} \frac{\rho\left(x_{i}\right)-\mathbb{E}^{\mu_{l_{i}}}}{}$ which depends on the mixture component). In particular using the same ressoning as for (2.4) we obtain (crucially note that Proposition $4.2$ does not require the random variables to be identically distributed): $$ \mathbf{P}\left(\frac{1}{n} \sum_{i=1}^{n}\left(f\left(x_{1}\right)-\mathbb{E}^{\mu_{e_{i}}}[f]\right) z_{i} \geq \frac{\epsilon}{8}\right) \leq 2 \exp \left(-\frac{\epsilon^{2} n d}{9^{4} c L^{2}}\right) $$ Next we want to appropriately modify (2.4). To do so note that: $$ \max _{\left.m_{1}, \ldots, m_{k} \in \mid-1,1\right]} \sum_{i=1}^{n} m_{\ell} z_{i}=\sum_{\ell-1}^{k}\left|\sum_{i \in S_{R}} z_{t}\right| $$ so that we ean rewrite (24) as: $$ \mathbf{P}\left(\exists f \in \mathcal{F}: \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}^{\mu \ell_{i}}[f] z_{i} \geq \frac{E}{8}\right) \leq \mathbf{P}\left(\frac{1}{n} \sum_{\ell=1}^{k}\left|\sum_{i \in S_{\ell}} z_{i}\right| \geq \frac{E}{8}\right) $$

Now note that $\sum_{\ell=1}^{k} \sqrt{\left|S_{\ell}\right|} \leq \sqrt{n k}$ and thus we have: $$ \mathbb{P}\left(\frac{1}{n} \sum_{\ell=1}^{k}\left|\sum_{i \in S_{\ell}} z_{i}\right| \geq \frac{\epsilon}{8}\right) \leq \mathbb{P}\left(\sum_{\ell=1}^{k}\left|\sum_{i \in S_{\ell}} z_{i}\right| \geq \frac{\epsilon}{8} \sqrt{\frac{n}{k}} \sum_{\ell=1}^{k} \sqrt{\left|S_{\ell}\right|}\right) \leq \sum_{\ell=1}^{k} \mathbb{P}\left(\left|\sum_{i \in S_{\ell}} z_{i}\right| \geq \frac{\epsilon}{8} \sqrt{\frac{n}{k}} \sqrt{\left|S_{\ell}\right|}\right) $$ Finally by Hoeffding's inequality, we have for any $\ell \in[k], \mathbb{P}\left(\left|\sum_{i \in S_{\ell}} z_{i}\right| \geq t \sqrt{\left|S_{\ell}\right|}\right) \leq 2 \exp \left(-\frac{t^{2}}{8}\right)$, and thus the last display is bounded from above by $2 k \exp \left(-\frac{n \epsilon^{2}}{8^{3} k}\right)$. The proof can now be concluded as in the case $k=1$.


Edit 1 :- As per comment section, I will shoot my doubt step by step eventually that integrate to a complete solution I guess.

DOUBT 1:-

Theorem $12.3$ (Hoeffding inequality) Given independent $\left(X_{1}, \ldots, X_{n}\right)$ with $X_{i} \in\left[a_{i}, b_{i}\right]$ a.s., $$ \operatorname{Pr}\left[\frac{1}{n} \sum_{i}\left(X_{i}-\mathbb{E} X_{i}\right) \geq \epsilon\right] \leq \exp \left(-\frac{2 n^{2} \epsilon^{2}}{\sum_{i}\left(b_{i}-a_{i}\right)^{2}}\right) $$ How this proof gears using that theorem, moreover there is a "$\geq$" in original proof. I am unable to find the catch.

Proof. The sequence $\left(z_{1}^{2}\right)$ is i.i.d., with mesn $\sigma^{2}$, and such that $\left|z_{1}\right|^{2} \leq 4$. Thus Hoeffding's inequality yielde: $$ \mathbf{P}\left(\frac{1}{n} \sum_{i=1}^{n} z_{1}^{2} \leq \sigma^{2}-\frac{\epsilon}{6}\right) \leq \exp \left(-\frac{n \epsilon^{2}}{g^{3}}\right) $$