Can I draw conclusion based on the following Monte Carlo estimate of entropy?

91 Views Asked by At

I would like to compare the entropies of two discrete distributions of tokens. Those two distributions of tokens $[x_1, x_2, ..., x_m]$ come from two different outputs of an autoregressive language model and are specified by the output probability vector $v = [p_1, p_2,..., p_m]$. The following two plots are the Monte Carlo estimates of the entropies w.r.t. the number of samples $n$ sampled from each distribution (i.e., $\hat{H} = -\sum^{n}_{i=1} \frac{p(x_i)}{Z} \log \frac{p(x_i)}{Z}$, where $Z$ is the normalizing constant $\sum^{n}_{i=1}p(x_i)$): Monte Carlo estimate of the entropy of the first distributionMonte Carlo estimate of the entropy of the second distribution. The sampling is done with torch.multinomial which is typically how it is done in this kind of use case. The reason I want to do estimation is that the sample space is too large and analytical solution is intractable. I understand the estimator is biased and in general, underestimates the true entropy, but I think it may be asymptotically unbiased. As the first plot consistently upper bounds the second plot as sample size increases, can I conclude the first distribution has larger entropy?

1

There are 1 best solutions below

3
On BEST ANSWER

To start with, observe that $\hat{H}_n$ doesn't converge to the entropy even as $n \to \infty$. Indeed, for a finite alphabet $\mathcal{X}$, if we define $N_n(x) = \#\{i : x_i = x\},$ then we can write $$\hat{H}_n = -\sum_{x \in \mathcal{X}} \frac{N_n(x)P(x)}{\sum_{x'} N_n(x')P(x')} \log \frac{P(x)}{\sum_{x'} N_n(x')P(x')}.$$ As $n \to \infty, N_n(x)/n \to P(x)$. So, for large $n$, this quantity behaves like $$ -\sum_{x \in \mathcal{X}} \frac{P(x)^2}{\sum_{x'} P(x')^2} \log \frac{ P(x)}{n\sum P(x')^2} = \log n -\sum Q(x) \log\frac{Q(x)}{P(x)} = \log n - D(Q||P),$$ where $Q$ is the distribution $Q(x) = P(x)^2/\sum P(x')^2$. This has very little to do with entropy, as such, and its doubly unclear why the for finite $n$ the quantity would be better behaved. Further note that the fact that your monte-carlo estimate was clearly growing with $n$ (and a fair bit) should have given you pause.


The usual monte-carlo estimate of entropy under the setup you have described is much simpler (I'll call this $\hat{G}$ to differentiate it from $\hat{H}$ defined above): $$ \hat{G}_n = -\frac{1}{n} \sum_{i \le n} \log P(x_i).$$ Note that this does converge to the entropy of $P$: $\hat{G}_n = -\sum_{x} (N_n(x)/n) \log P(x),$ and $N_n(x)/n \to P(x)$ for each $x$ a.s. Of course, this is not quite enough, since you need to worry about the variance of the estimate. It is relatively simple to do this for finite alphabet laws, but the real trouble with this estimate arises because $-\log P(x)$ can be very very large, which mucks with getting effective tail bounds: without strong assumptions, all you can do it set up Chebycheff style bounds that are far too ineffective, especially for testing (and even more so for testing sequentially, as you appear to be doing).

I'll propose a variant that works if you have a fixed error tolerance $\varepsilon$, so that you only need to estimate the entropy to within error $\varepsilon$ (with high probability). Here's the idea: let $m$ be a parameter to be set later. I'll define the estimate $$ \hat{F}_n(m) = -\frac{1}{n} \sum_{ i \le n} \log (\max(P(x_i), m)).$$ This introduces a bias, but this bias can be controlled well. Indeed, observe that $\max_{ u \in [0,m]} u \log(m/u) = em.$ Therefore, $$ 0 \le H - \mathbb{E}[\hat{F}_n(m)] \le e|\mathcal{X}| m,$$ and if we set $m = \varepsilon/2e|\mathcal{X}|,$ then this gap is at most $\varepsilon/2$.

Now, we can exploit this as follows: the random variable $-\log \min(P(X), m)$ is bounded by $\log m$, and therefore, by Hoeffding's inequality, for any $n$, $$ P( |\hat{F}_n(m) - \mathbb{E}[\hat{F}_n(m)]| > t) \le \exp( - 2 nt^2/\log^2(m)).$$ Further, we can derive laws of iterated logarithms to allow sequential estimation. For instance, using this paper (I'm using weaker, but neater, constants here, see the paper for tighter bounds), if we define $$ \lambda(n,m,\delta) = 2|\log m| \sqrt{\frac{\log(\max(1,\log(2n)) + \log(11/\delta)}{n}} ,$$ then $$ P\left(\exists n\ge 1: |\hat{F}_n(m) - \mathbb{E}[\hat{F}_n(m)]| > \lambda(n,m,\delta) \right) < \delta.$$

This sets up a concrete way to test if the entropy of one law is larger than another, up to a tolerance. Suppose you have access to two distributions, $P_1,P_2$ in the same way, and you want to test if $H(P_1) > H(P_2),$ or otherwise. I'll assume that you have an upper estimate of $|\mathcal{X}|$, a tolerance $\varepsilon$ (so that if $|H(P_1) - H(P_2)| < \varepsilon$, then the result of the test doesn't matter), and a probability of error $\delta$. Then run the following procedure:

  • first set $m = \varepsilon/(2|\mathcal{X}e).$

  • For each $n$, sample $X_n \sim P_1$ and $Y_n \sim P_2$, and compute the estimates $$\hat{F}_n^1(m) = \sum_{i\le n} -\log (\max(m,P_1(X_i))/n, \\\hat{F}_n^2(m) = \sum_{i \le n} -\log(\max(m,P_2(Y_n))/n.$$

  • If at any round $n$, $$|\hat{F}_n^1(m) - \hat{F}_n^2(m)| > \varepsilon/2 + 2\lambda(n,m,\delta), $$ then stop and declare that the $P_1$ has the higher entropy if $\hat{F}_n^1(m) > \hat{F}_n^2(m),$ and that $P_2$ has the higher entropy otherwise.

  • Finally, if $n > \min\{ i : \varepsilon/2 > 2\lambda(i,m,\delta)\},$ then stop (this time can be precomputed, and is roughly $\frac{\log m}{\varepsilon^2} \log\frac{\log(m/\varepsilon^2)}{\delta}$).

If $\Delta H = |H(P_1) - H(P_2)| > \varepsilon,$ the above test can be shown to stop after roughly $\frac{\log m}{(\Delta H)^2}$ samples (upto $\log\log$ terms), and be correct, with probability at least $1-\delta$. With the remaining $\delta$ probability, can stop and identify the wrong distribution has having the higher mean, or can continue until the precomputed sample limit $\log(m)/\varespilon^2$. In the adaptive stopping case, the $(\Delta H)^{-2}$ dependence should be unavoidable. I don't know about the $\log m$, but since $m \sim |\mathcal{X}|/\varepsilon,$ this is not unreasonable. For instance, even with the ridiculous $|\mathcal{X}| = 10^{50}$ and $\varepsilon = 10^{-10},$ this still works out to only $60 \log 10 \approx 200,$ which is quite reasonable. It may be possible to further reduce this by setting $m$ adaptively as more samples are drawn.

Further, this test may be potentially wasteful for another reason: under the hood, it computes confidence bounds for $H(P_1)$ and $H(P_2)$ separately, and stops when these separate. However, it may be possible to directly build confidence bounds for $H(P_1) - H(P_2)$ (for instance, by studying clipped variants of $D_n = \frac{1}{n} \sum_i \log P_1(X_i) - \frac{P_2(X_i)}{P_1(X_1)} \log P_2(X_i),$ where $X_i \sim P_1$ always). I'll leave it to you to explore such possibilities.