Does Hoeffding's bound give tighter bound than Central Limit Theorem does on sample dependency?

315 Views Asked by At

From Central Limit Theorem, we know that $\mu - \bar{x}$ has a variance of $\frac{\sigma^2}{n}$ where $\sigma^2$ is the population variance. We can roughly interpret it is as $P(|\mu - \bar{x}|\ge\epsilon)$ is inversely proportional to sample size $n$. On the other hand, the version of the Hoeffding' inequality that is used in PAC bound derivation states that $P[|\mu - \bar{x}| \geq \varepsilon] \leq 2e^{-2n\varepsilon^2}$. Here the probability of $\bar{x}$ deviating from $\mu$ decays exponentially w.r.t. sample size $n$. So apparently the Hoeffding inequality gives a tighter bound than CTL does. Is it because there are more assumptions made for Hoeffding inequality's derivation?

1

There are 1 best solutions below

14
On BEST ANSWER

To understand the conclusion of the central limit theorem, it is necessary to understand weak convergence. We say a sequence of random variables $X_n$ converges weakly to $X$, denoted $X_n\Rightarrow X$, if $P(X_n\leq t) \to P(X\leq t)$ as $n\to\infty$ for all $t$ such that $t\mapsto P(X\leq t)$ is continuous. Two equivalent definitions of weak convergence are

  1. $E[f(X_n)]\to E[f(X)]$ as $n\to\infty$ for all continuous and bounded $f$.
  2. $P(X_n \in B) \to P(X\in B)$ as $n\to\infty$ for all continuity sets $B$, that is, $P(X\in \partial B) = 0$, where $\partial B = \bar B - \overset{\circ}{B}$ denotes the boundary.

It turns out that if $t\mapsto P(X\leq t)$ is continuous, then weak convergence can be upgraded to $$\sup_{t\in \mathbb R}|P(X\leq t)-P(X_n\leq t)| \underset{n\to\infty}{\longrightarrow}0$$ This brings us to the (classical) central limit theorem, which asserts that for any sequence of i.i.d random variables with mean $\mu$ and variance $\sigma^2<\infty$ that $$\sqrt n(\hat \mu - \mu)\Rightarrow \mathcal N(0,\sigma^2)$$ At this point, it is tempting to use the normal approximation to do computations, which leads to the question of when this is justified. The obvious answer is that anything which relates to weak convergence is justified, though still only at a very heuristic level. Indeed, it would seem that it is OK to approximate $$P(\sqrt n|\hat \mu-\mu| \leq \epsilon) \approx P(|Z|\leq \epsilon),\quad Z\sim\mathcal N(0,1)$$ What is not justified is to use this to make qualitative statements about the rate of convergence of the sample mean $\hat \mu$ to $\mu$. Indeed, gaussians have extremely tight concentration around their mean: $$P(Z\geq t) \leq \frac{1}{t}\frac{1}{\sqrt{2\pi}}e^{-t^2/2},\quad t>0$$ There is no chance at all for random variable $X$ with only finite variance to achieve such concentration. In fact, such tight concentration would imply the moment generating function of $X^2$ exists. The point is that one cannot and should not make qualitative statements using the approximation provided by the central limit theorem.

In fact, even the use of such an approximation is not formally justified by the classical central limit theorem, but it is standard practice in statistics. It is possible to do so using the Berry-Esseen central limit theorem, but then you also do need finite third moment.

More generally, one can use the central limit theorem to do approximations of continuous functionals. (Again, this is heuristic and not at all precise, but this is standard practice.) It turns out that many quantities of interest are functions of the distribution function $F$. For example, $$F \mapsto \int g(x)\,dF(x),\quad F\mapsto \|F-F_0\|_\infty$$ Using the central limit theorem to approximate functionals therefore boils down to whether $$(X,d)\ni F \mapsto \gamma(F) \in \mathbb R$$ is continuous, where $X$ is the space of distribution functions and $d(F,G) = \|F-G\|_\infty$. It is well known that the functional $F\mapsto \int x^p\,dF(x), p \in \mathbb N$ is not continuous. (It is easy to construct a counterexample if you know about dominated convergence. There are special conditions where it is continuous, but not in general.) In other words, the central limit theorem does not assert that $$\mathrm{Var}(\hat \mu) = \frac{\sigma^2}{n},\quad \hat \mu \equiv \hat \mu(n)$$, nor is the approximation $$\mathrm{Var}(\hat \mu) \approx \frac{1}{n}$$ justified, not even heuristically. Rather, this conclusion follows from $$\mathrm{Var}\left(\frac{S_n}{n}\right) = \frac{1}{n^2}\mathrm{Var}(S_n) = \frac{1}{n^2} \sum_{i=1}^n \mathrm{Var}(X_n) = \frac{1}{n}$$

As for Hoeffding's inequality, it is necessary to introduce the notion of subgaussian random variables. Call $X$ $\sigma^2$ subgaussian if $$E[e^{\lambda (X-E[X])}]\leq e^{\lambda^2\sigma^2/2},\quad \forall\,\lambda \in \mathbb R$$ For convenience, and really with no loss of generality, I'll take $X$ to be centered from now on. If $X$ is $\sigma^2$ subgaussian and $t>0$, then $$P(X\geq t) = P(e^{\lambda X} \geq e^{\lambda t}) \leq \inf_{\lambda>0}e^{-\lambda t}E[e^{\lambda X}] \leq e^{-t^2/(2\sigma^2)}$$ Symmetry yields $$P(|X|\geq t) \leq 2e^{-t^2/(2\sigma^2)},\quad t>0$$ It is easy to see that the sum of independent subgaussians is also subgaussian with parameter given by the sum of the respective parameters. In particular, if $X_1,\dots,X_n$ are i.i.d $\sigma^2$ subgaussian, then $$P(|S_n|\geq t)\leq 2e^{-t^2/(2n\sigma^2)} \implies P(|n^{-1}S_n|\geq t)\leq 2e^{-nt^2/(2\sigma^2)}$$ This of course begs the question of which random variables are subgaussian. It is a standard exercise (e.g. see Wainwright High Dimensional Statistics Chapter 2) to show that if $X\in [a,b]$ almost surely, then $X$ is $\left(\frac{b-a}{2}\right)^2$ subgaussian. This yields Hoeffding's inequality: If $X_1,\dots,X_n$ are independent with $X_i\in[a_i,b_i]$ almost surely, then $$P\left(\left \lvert \frac{1}{n}\sum_{i=1}^n X_i - E[X_i]\right\rvert \geq t\right) \leq 2\exp\left(\frac{-2n^2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right),\quad t>0$$ Finally, if $X$ is $\sigma^2$ subgaussian, then it is easy to show $\mathrm{Var}(X) \leq \sigma^2$. Note that this method of obtaining concentration is "blind" to the details of the underlying distribution, and that it is of course possible to obtain even tighter concentration results given more concrete details about the distribution. Indeed, if you look at how Hoeffding's inequality is derived, it follows from showing $X\in [a,b]$ almost surely is $(b-a)^2/4$ subgaussian. In fact, we have $\mathrm{Var}(X)\leq (b-a)^2/4$, so in some heuristic sense, because Hoeffding's inequality deals with the most general case, it necessarily also assumes the worst possible concentration of measure for the random variables in question.

(There is an alternate standard argument for Hoeffding's, but I prefer the presentation here as it is more "probabilistic," whereas the other is strictly analytic).