How to reconcile "conditioning reduces entropy" with certain phenomena in Bayesian inference

44 Views Asked by At

From information theory we know that conditioning random variable X on Y will not increase its entropy (i.e., $H(X|Y) \leq H(X)$).

However, in Bayesian inference, we know that if a prior is strong (i.e., peaked at some place) but wrong (i.e., the peak is far away from the true parameter), after observing data, the posterior will first get spread out and gradually peak at the correct position.

How can these two be consistent? (i.e., when the posterior firstly spreads out, how can the entropy still be non-increasing?)

1

There are 1 best solutions below

0
On BEST ANSWER

The conditional entropy is an average $H(X|Y) = \sum P(Y = y) H(X|Y = y)$. Further, the law of $X$ implicit in it (which itself induces the law on $Y$) is exactly the prior. So, if $X$ took on an atypical value, then this could put mass on $y$s for which $H(X|Y = y)$ is large, but the typical $X$ values put mass on $y$s for which $H(X|Y = y)$ is small, in such a way that the average works out to something smaller than the prior entropy. In other words, if you did many independent experiments of drawing an $X_i$ according to the prior, sampling the induced $Y_i$, and computing the posterior, then most of the times the pointwise conditional entropy $H(X|Y = y_i)$ would be smaller than $H(X)$. This should make sense even if the prior is peaked, because since you're drawing from this prior itself, most of the time you're getting values of $X$ in the peaked regions anyway.

Perhaps a concrete example helps illustrate this. Say $X \sim \mathrm{Bern}(0.99)$, and $Y = X \oplus Z$ for $Z \sim \mathrm{Bern}(0.1)$ independent of $X$. Then we have $$ P(Y = 1) = 0.99 \cdot 0.9 + 0.01 \cdot 0.1 = 0.892.$$

Further, $$ P(X = 1|Y = 1) \approx 0.998, P(X = 1|Y = 0) \approx 0.92.$$ Now, its certainly the case that $$0.4 \approx H(X|Y = 0) \gg H(X) \approx 0.08,$$ and you're asking how that is consistent with $H(X|Y) \le H(X)$. But notice that $H(X|Y = 1) \approx 0.02$ is much smaller than $H(X)$, and $P(Y = 1) \gg P(Y = 0),$ so there's no contradiction.