Why can we define $0\log 0=0$ and $0\log 0/0=0$ if this is not true for every path leading to $0^+$?

231 Views Asked by At

Very often in information theory literature we see these two definitions (e.g., Aczel, J. and Daroczy, Z., 1975; Ebanks, Sahoo and Sanders, 1997)

$$ 0 \log 0 = 0,$$ $$ 0 \log \frac{0}{0} = 0.$$

I know that the limit $\lim_{p\rightarrow0} p\log\frac{p}{p} = 0$, however recently I became aware that the limit does not exist for all paths leading to $0^+$:

https://cstheory.stackexchange.com/questions/46406/when-is-it-necessary-to-define-0-log-0-0-and-0-log-0-0-0

e.g.,

$$\lim_{p\rightarrow0} p\log\frac{p}{e^{-\frac{1}{p}}} = 1.$$

In this case, why are we allowed to make this definition, and what are the consequences to the proofs using this definition?

2

There are 2 best solutions below

0
On

As you say, the second "definition", $0 \log \frac{0}{0} = 0$ is not justified if we interpret it in the sense $\lim_{x,y \to 0} x \log x/y =0$. This is not true, and you have provided an apt counterxample.

More in concrete, let $p_{i,n}$ $q_{i,n}$ be two families of distributions over $i = 1,2,3$, indexed by $n=1,2 \cdots$, such that $$p_{1,n}=2^{-n}, \quad p_{2,n}=p_{3,n}=\frac12 (1-p_{1,n}) $$ $$q_{1,n}=p_{1,n} \exp(-1/p_{1,n}), \quad q_{2,n}=q_{3,n}=\frac12 (1-q_{1,n}) $$

Then $p_{1,n}\to 0 $ and $q_{1,n}\to 0$ but $ p_{1,n} \log(p_{1,n}/q_{1,n})=1$ and in the sum $$ \sum_i p_{1,n} \log(\frac{p_{1,n}}{q_{1,n}})$$ the first term is the only that does not tend to zero.


However, those texbooks are not wrong. Because the definitions are not really about limits. It's about probability functions that could be zero for some values of its domain.

Consider these two pmf (over the domains $\{1,2\}$ and $\{1,2,3\}$ resp): $p_A=(\frac12,\frac12)$ $p_B=(\frac12,\frac12,0)$. These pmf are formally different. However, we consider them totally equivalent for all practical purposes. And it would be quite weird if some statistical measure (some functional of a pmf) we define would give different results for each of them. In particular, we strongly want that for the entropies: $H(A)=H(B)$. For that, the definition $H(X)=-\sum p(x) \log(x)$ must come attached with the convention $0 \log 0=0$. (The alternative would be restring the sum for $x\in D : p(x)>0$ but this would be clumsy).

So you see, it's not about limits. True, $\lim_{x\to 0} x \log(x)=0$ , but that's not the point. The fact that that limit holds, merely tells us that the above definition implies continuity , so if we have some sequence of pmfs $\lim p_n = p$ then $\lim H(p_n) = H(\lim p_n)= H(p)$. A nice result, of course, but not essential.

Coming into the other "definition": $0 \log 0/0=0$. This one comes attached with the KL definition $D(p||q)=\sum p(x) \log(p(x)/q(x))$, for the very same motives. We don't want the KL to change (or become undefined) if we trivially extend the (common) domain with points where $p$ and $q$ are both zero.

Now, as we have noted above, the limit does not necessarily agree with that definiton. But this only implies that the KL is not continuous, strictly speaking. It can happen (see example above) that $(p_n,q_n) \to (p,q)$ but $\lim D(p_n || q_n) \ne D(p || q)$. This is rather unfortunate, but not tragic and practically inconsequential.

0
On

This started as a mild counterpoint to leonbloy which got too long for a comment on their answer. Thanks to them, and to OP, for making me think about this!


I wouldn't say that the issue you raise its entirely inconseuquential. The examples you construct with non-zero limit are, essentially, getting at the phenomenon that weak convergence of distributions is a strictly weaker notion than convergence in the KL sense. We know this is true - generally $D$ is only lower semincontinuous under weak convergence. As an egregious example, take $Q$ to be any continuous distribution, and $P_n$ to be the empirical measure generated from $n$ iid samples from $Q$. Then $D(Q\|P_n) = D(P_n\|Q) = \infty$ for every $n$ even though $P_n \Rightarrow Q$. This has the very practical - and natural - consequence that if one is interested in density estimation under metrics like KL, they had better smoothen their naive estimates with some continuous kernel (same is also true if you're chasing, say, estimation under $L_1$).

However, the context in which the above conventions are made is important. These are typically the sort of settings where one doesn't worry about evolving distributions, or they are evolving in some simple sense - say I care about $D(p^{\otimes n} \| q^{\otimes n}),$ and the distributions all happen to be discrete, or all happen to have densities. In these cases all the worrying about edge cases is not really needed and the convention adopted is an easy shorthand.

A lot of these issues vanish with the 'correct' definition of KL divergence. For generic probability measures, $$ D(P\|Q) := \begin{cases} \infty & P\not\ll Q \\ \mathbb{E}_Q[ \frac{\mathrm{d}P}{\mathrm{d}Q}(X) \log \frac{\mathrm{d}P}{\mathrm{d}Q}(X) ] & P \ll Q\end{cases}.$$

Now, since $Q(A) = 0 \implies \int_A f(x) Q(\mathrm{d}x) = 0 $ for all measurable $A$ and any measurable $f$ (which is a consequence of defining integrals via suprema of integrals of simple functions), and since the $x$s with $\mathrm{d}P/\mathrm{d}Q(x) = 0$ are handled via continuity of $u\log u$, the correct - in my opinion - specialization to a discrete context is $$ D(P\|Q) = \begin{cases} \infty & \exists x: p_x > 0 = q_x \\ \sum_{x:q_x > 0} q_x \frac{p_x}{q_x} \log \frac{p_x}{q_x} & \forall x\, q_x = 0 \implies p_x = 0 \end{cases}.$$

This is a mouthful. Can we simplify? Sure.

First, in the second case in the definition, if we cancel the $q_x$s and then adopt the convention $0\log 0/0 = 0$, then we can extend the summation to all $x$. Note that this is not justified by continuity. It's a formal convention which holds only when both $p_x$ and $q_x$ are zero, and is technically handling the $Q(A) = 0$ case in the above. Next, we can also fold in the first of the two cases by allowing $p$ to be arbitrary, and adopting the convention $u \log (u/0) = \infty$ for $u > 0$, which makes intuitive sense. Now, as argued by leonbloy, things don't change so much, the correct intuitions are captured, and you have a formula you can state to an undergrad without a bunch of technicalities around it making it seem more complicated :)