Proving a "quasi-concentration" inequality via covering numbers

38 Views Asked by At

Consider a parametric family of real-valued continuous functions $\mathcal F = \{f_\theta\mid \theta\in\Theta\} $ defined on some domain $\mathcal X\subseteq \mathbb R^d$, where each element of $\mathcal F$ is determined by its parameter $\theta \in \Theta \subseteq \mathbb R^p$, where $\Theta$ is a compact set. The family $\mathcal F$ is such that for any $\varepsilon>0 $, there exists $\delta\equiv\delta(\varepsilon)>0$ such that $|\theta_1 - \theta_2|\le \delta\implies\|f_{\theta_1}-f_{\theta_2}\|_{\infty} \le \varepsilon $. Furthermore $\mathcal F$ is uniformly bounded by some constant $F>0$, i.e. $\sup_{f_\theta\in\mathcal F} \|f\|_\infty \le F $. (here $\|\cdot\|_\infty$ denotes the supremum norm on $\mathcal X$)

Now given a collection $x_1,\ldots,x_n\sim\rho$ of i.i.d. random variables, where $\rho$ is a probability measure on $\mathcal X$, and $\theta,\vartheta\in\Theta $, I would like to have an upper bound on the following probability for $t> 0$ : $$\mathbb P\left(\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\vartheta}(x)]\right|> t\right) \stackrel{?}{\le} a_n \tag1$$ Where, hopefully, $a_n$ is of the form $a_n \sim Ce^{-Dn} $ for $C$ and $D$ two positive constants which may depend on $t$ (but not on $n$).

This problem arised as I was trying to estimate the generalization error of some machine learning algorithm, and the main difficulty comes from the fact that I do not assume $\theta = \vartheta$. Indeed, if we had $\theta = \vartheta$, we could apply a covering number argument as follows : if we denote by $\Theta_{\delta(t)/3}$ a $\delta(t)/3$-cover of $\Theta$ with minimal cardinality, we have for all $\bar\theta \in \Theta_{\delta(t)/3}$ that $$\begin{align*}\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\theta}(x)]\right| &\le\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\frac1n\sum_{i=1}^n f_{\bar\theta}(x_i)\right| + \left|\frac1n\sum_{i=1}^n f_{\bar\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\bar\theta}(x)]\right| \\ &+ \left|\mathbb E_{x\sim\rho}[f_{\bar\theta}(x)]-\mathbb E_{x\sim\rho}[f_{\theta}(x)]\right|\\ &\le \|f_{\theta}-f_{\bar\theta}\|_\infty + \left|\frac1n\sum_{i=1}^n f_{\bar\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\bar\theta}(x)]\right| + \|f_{\theta}-f_{\bar\theta}\|_\infty\\ &\le \frac{2t}{3} + \left|\frac1n\sum_{i=1}^n f_{\bar\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\bar\theta}(x)]\right|\end{align*}$$ Hence it follows that $$\begin{align*}\mathbb P\left(\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\theta}(x)]\right|> t\right) &\le \mathbb P\left(\sup_{\theta\in\Theta}\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\theta}(x)]\right|> t\right)\\ &\le \mathbb P\left(\sup_{\bar\theta\in\Theta_{\delta(t)/3}}\left|\frac1n\sum_{i=1}^n f_{\bar\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\bar\theta}(x)]\right|> \frac t3\right)\\ &\le 2|\Theta_{\delta(t)/3}|\exp\left({-\frac{2nt^2}{9F^2}}\right) \tag{1'}\end{align*} $$ Where the last inequality follows from the union bound and Hoeffding's inequality. Note that by definition, $|\Theta_{\delta(t)/3}| $ is the $\delta(t)/3$ covering number of $\Theta$ with respect to the Euclidean norm.
For $(1)$ however, we may have $\theta\ne\vartheta$, in which case the above argument falls apart.

Hence my question : Is it possible to bound the probability $\mathbf{(1)}$ by modifying the argument used to obtain the bound $\mathbf{(1')}$ ? If not, is there any way to obtain a non-vacuous bound on that probability ? As a commenter has made clear, additional assumptions (on $\rho$ or $\mathcal F$ or whatever...) will be necessary in order to reach the desired conclusion.

Thanks in advance.


Here is a naïve solution I have come up with : observe that for any $\theta,\vartheta\in \Theta$, we have $\|f_\theta-f_\vartheta\|_\infty\le \|f_\theta\|_\infty + \|f_\vartheta\|_\infty \le 2F $. Hence if we assume that $\mathbf{2F < t}$, we can write $$\begin{align*}\mathbb P\left(\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\vartheta}(x)]\right|> t\right) &\le \mathbb P\left(\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\theta}(x)] + \mathbb E_{x\sim\rho}[f_\theta(x) - f_\vartheta(x)]\right|> t\right)\\ &\le\mathbb P\left(\left|\frac1n\sum_{i=1}^n f_{\theta}(x_i) -\mathbb E_{x\sim\rho}[f_{\theta}(x)]\right|> t - 2F\right)\\ &\le2\exp\left({-\frac{2n(t-2F)^2}{4F^2}}\right)\end{align*} $$ I don't like this too much though because in my application, $t$ is typically small while $F$ is typically large. Anyone has a better idea ?