Convergence of empirical means in other norms

324 Views Asked by At

Say that we have a sequence $x_1,x_2,\dots$ of i.i.d. random vectors in $\mathbb{R}^n$ with mean $0$ and variance $\sigma^2$, meaning $$\mathbb{E}[\|x_i\|^2_2] = \sigma^2$$ for all $i$. Then it's a pretty standard excercise that the variance of the empirical averages tends to zero: $$\mathbb{E}\left[\left\|\frac{1}{N}\sum\limits_{i=1}^{N}x_i\right\|_2^2\right] = \frac{\sigma^2}{N}$$ What happens if I replace the Euclidean norm $\|\cdot\|_2$, in both the definition of $\sigma$ and in the norm of the average, with something else, for instance $\|\cdot\|_p$ with $p=1$ or $p=\infty$? Can I still obtain a bound, something like

$$\mathbb{E}\left[\left\|\frac{1}{N}\sum\limits_{i=1}^{N}x_i\right\|_p^2\right] \leq K\frac{\sigma^2}{N}$$ for appropriate $K$? I know that this is possible to do by invoking the equivalence any norm to the Euclidean norm, but this approach will introduce factors that depend on the dimension. My question: Is it possible to obtain a similar bound on the variance of the empirical averages that is independent of the dimension?

2

There are 2 best solutions below

8
On BEST ANSWER

This is false, at least for $\ell_1$. I'm going to use the following result: learning an arbitrary probability distribution $p$ over $\{1,..,n\}$ to total variation distance ($\ell_1$ distance between pdfs) say $0.01$ requires $\Theta(n)$ i.i.d. samples from $p$.

Now, see $p$ as a vector in $\mathbb{R}^n$ (a vector of non-negative values summing to $1$). If you let $\mu$ be the vector $\mu=(p(1),\dots,p(n))$, then learning $p$ to total variation distance is equivalent to your problem: you want to learn $\mu$ in $\ell_1$ distance, you get i.i.d. vectors in $\mathbb{R}^n$ with mean $\mu$ and $\sigma^2=1$ (each sample has exactly one non-zero coordinate, equal to one).

This means that, at least for $\ell_1$, your $K$ has to depend on $n$, otherwise you could learn $\mu$ with a number of samples independent of $n$.

To summarize: the (minimax) rate of learning a univariate probability distribution over $n$ elements in total variation (TV) distance has a linear dependence on $n$ (see, e.g., [1]). Given $N$ i.i.d. samples from such a distribution $p$, you can simulate $N$ i.i.d. vectors in $\mathbb{R}^n$ (just by mapping each sample to its indicator vector), and learning the mean of those vectors in $\ell_1$ is equivalent to learning the distribution $p$ in TV distance. Therefore, the rate of the latter question must have at least a linear dependence on $n$ as well (as otherwise, the could perform the former task too efficiently).


[1] A short note on learning discrete distributions, by C. Canonne https://arxiv.org/abs/2002.11457 (just a reference for some of these learning/density estimation "folklore" facts)

3
On

To quote the above answer : "This is true, at least for $\ell_p$, $p \geq 2$" :-)

This is simply because on $\mathbb R^n$ (or more generally $\mathbb C^n$), if $0<p<q$,

$$\|x\|_{q}\le \|x\|_{p},$$

a surprisingly simple inequality a colleague taught me last week, see this question.

On the other hand the converse inequality (Jensen or Hölder - thanks Clement C.)

$$\|x\|_{p}\le n^{1/p-1/q} \|x\|_q$$

gives you a bound for $p \le 2$:

$$\mathbb{E}\left[\left\|\frac{1}{N}\sum\limits_{i=1}^{N}x_i\right\|_p^2\right] \leq \sigma^2 \frac{n^{2/p-1}}{N} $$

that indeed depends on the dimension.

EDIT : I noticed I did not change the definition of $\sigma$ as requested by the OP. Hence not a valid answer.