Concentration of the $\ell_2$ error of the empirical distribution

446 Views Asked by At

Let $X$ be a random variable that takes values only in the set $\{1,2, \dots, m\}$ such that $\Pr(X = i) = p_i$ for all $i = 1,2, \dots, m$. Let $S = \{X_1, X_2, \dots, X_n\}$ be a set of $n$ i.i.d. realisations of $X$. We can then construct the empirical distribution $\{\hat{p}_i\}_{i = 1}^m$ using the samples from $S$, where $\hat{p}_i = \frac{1}{n} |\{ j : X_j = i\}|$.

I am interested in the concentration inequalities for the random variable $Z := Y - \mathbb{E}[Y]$ where $Y = \| \hat{p} - p \|^2_2$, the square of the $\ell_2$ norm of the difference between the true and the empirical distribution. I have found results in the literature on the concentration of the $\| \hat{p} - p\|_1$, $D(\hat{p} \| p)$ and $\sup_{i} |\hat{p}_i - p_i |$, where $D(p\|q)$ is the KL divergence between $p$ and $q$. However, quite surprisingly, I haven't come across any result on the concentration of the $\ell_2$ norm. While it is possible to directly use the results for other norm for my case, it turns out that the resulting bounds are not tight enough for my case.

Any leads or references will be appreciated. Thanks!

1

There are 1 best solutions below

5
On

Note that if $F_n$ denotes the empirical c.d.f. and $F$ the c.d.f., and $i^{\star} = \text{argmax}_{1 \leq i \leq m} |\hat{p}_i - p_i|$, then

By these three points (in order), we have, for all $t > 0$, \begin{align*} \mathbb{P}\Big(\|\hat{p}_n - p\|_2 > t\Big) \leq \mathbb{P}\Big(\|\hat{p}_n - p\|_{\infty} > \frac{t}{\sqrt{m}}\Big) \leq \mathbb{P}\Big(\|F_n - F\|_{\infty} > \frac{t}{2 \sqrt{m}}\Big) \leq 2 e^{- \frac{n t^2}{2 m}}. \end{align*}

Furthermore, if you want the concentration about the mean, note that $x\mapsto \sqrt{x}$ is concave, so Jensen's inequality yields \begin{align*} \mathbb{E}\Big[\|\hat{p}_n - p\|_2\Big] &\leq \sqrt{\mathbb{E}\Big[\sum_{i=1}^m |\hat{p}_i - p|^2\Big]} \\ &= \sqrt{\sum_{i=1}^m \frac{1}{n^2} \mathbb{E}\Big[|\sum_{j=1}^n \mathrm{1}_{\{X_j = i\}} - n p_i|^2\Big]} \\ &\leq \sqrt{m \cdot \frac{1}{n^2} \cdot n \max_{1 \leq i \leq m} p_i (1 - p_i)} \quad \text{since } \sum_{j=1}^n \mathrm{1}_{\{X_j = i\}}\sim \text{Binomial}(n, p_i) \\ &\leq \sqrt{\frac{m}{4 n}} \quad \text{since } \max_{x\in [0,1]} x(1-x) = 1/4 \end{align*} and thus, by the previous concentration bound, we have \begin{align} \mathbb{P}\bigg(\Big|\|\hat{p}_n - p\|_2 - \mathbb{E}\Big[\|\hat{p}_n - p\|_2\Big]\Big| > t\bigg) &\leq \mathbb{P}\Big(\|\hat{p}_n - p\|_2 + \mathbb{E}\Big[\|\hat{p}_n - p\|_2\Big] > t\Big) \\ &\leq \mathbb{P}\Big(\|\hat{p}_n - p\|_2 > t - \sqrt{\frac{m}{4 n}}\Big) \\ &\leq 2 \exp\Bigg(- \frac{n \big(t - \sqrt{\frac{m}{4 n}}\big)^2}{2 m}\Bigg) \\ &\leq 2 \exp\Big(-\frac{n t^2}{2m}\Big) \exp\Big(+ \frac{t}{4} \sqrt{\frac{n}{m}}\Big) \end{align} In particular, if $\lim_{n\to \infty} \frac{t}{\sqrt{n/m}} = 0$, then for any given $\epsilon > 0$, we have, for $n$ large enough, \begin{align} \mathbb{P}\bigg(\Big|\|\hat{p}_n - p\|_2 - \mathbb{E}\Big[\|\hat{p}_n - p\|_2\Big]\Big| > t\bigg) &\leq 2 \exp\Big(-\frac{n t^2}{(2 + \epsilon) m}\Big). \end{align} This whole reasoning is valid even if $m$ is a function of $n$.