Intuition for Local vs. Global notions of Metric Entropy in Statistics

97 Views Asked by At

I am looking for intuition regarding the following statement on page 4 of this paper by Gassiat and Van Handel:

However, in finite dimensional settings, global entropy bounds are known to yield sub-optimal results, and here local entropy bounds are essential to obtain optimal convergence rates of estimators.

The paper itself and the cited references all show this claim generally by arguing that using global entropy bounds yield suboptimal rates, and then showing that the localized approach achieves the optimal/better rates. So from that perspective, the statement is clear, and I guess it is enough justification for the claim. What I am looking for though is some intuition on why this happens to be the case. What is it (from an intuitive perspective) about localized analyses that makes them sharper? How should one think about localized notions of dimension as opposed to global ones? I'm also interested in how other branches of mathematics (analysis, geometry etc) approach this question.

Some background:

For a subset $T$ of a metric space $(X,d)$, the covering number $N(T,\epsilon)$ is the smallest possible number of $\epsilon$-balls that can cover $T$. Formally,

$$ N(T,\epsilon) := \inf \{ n: \exists \{x_i\}_{i=1}^n \subset X: T \subset \bigcup_{i=1}^n B_{x_i}(\epsilon) \} $$

where $B_{x_i}(\epsilon)$ is the ball (in the metric $d$) that is centered at $x_i$ and has radius $\epsilon$. The metric entropy is $\log N(T, \epsilon)$.

In a statistical setting, covering numbers are useful for establishing uniform laws. For example, suppose $X, X_1,\dots, X_n$ are i.i.d. random vectors and $F$ is a class of functions, results of the following flavor are common:

$$ \mathbb{P} \left( \sup_{f \in F} \left | \frac{1}{n} \sum_{i=1}^n f(X_i) - \mathbb{E} [f(X)] \right | > \epsilon \right) \le g( N(F, \epsilon'), ...), $$ where $g$ is some function of the covering number and possibly other parameters of the problem. The result basically says that if the function class $F$ is not too large/complex (as measured by the covering number), then with high probability (over the sample), the worst case fluctuation of an empirical average from it's true mean is not too large. Other related notions here are the Rademacher complexity, and VC dimension, or if $F$ is finite, then we can use the cardinality of $F$.

Localization:

A localized covering number is given by $$ N(T \cap B_{t_0} (\delta), \epsilon), $$ where $t_0 \in T$ is a fixed point. Generally, $t_0$ is taken to be the 'true' parameter/function that generated the data, say $f^*$, and so the RHS of the above bound becomes something like: $$ g'( N(F \cap B_{f^*} (\delta), \epsilon), ...). $$ for some different function $g'$. This generally yields better (smaller) upper bounds than in the non-localized analyses. This style of analysis has become popular in the statistics literature over the last 15-20 years, for example this paper uses a localized version of the Rademacher complexity.