Neural Networks - Data Processing Inequality Issue

434 Views Asked by At

The data processing inequality states that if you have a Markov chain of random variable $X \rightarrow Y \rightarrow Z$, then $I(X;Y) \geq I(X;Z)$.

This all makes sense in the discrete case, but in the continuous case, which seems to be where it is actually used (in the case of neural networks https://arxiv.org/abs/1703.00810), there is a counter-example:

If I pick $X=unif(0,0.5)$, and $Y=X$, and $Z=c$ where $c$ is some constant.

then $I(X;Y)=I(X;X)=H(X)=-\log(2)$ and $I(X;Z)=0$ since $X$ and $Z$ are certainly independent.

but $-\log(2) \ngeq 0$. So the data processing inequality is wrong?

Is there any way to resolve this issue?

1

There are 1 best solutions below

3
On

The line

$$I(X;Y)=I(X;X)=H(X)=-\log(2)$$

is wrong. Which equality is false, depends on what you mean by $H(X)$

If you mean the differential entropy (let's better write $h(X)$ in that case), then the equality $I(X;X)=h(X)$ is false. It's indeed true that $I(X;X)=h(X)-h(X\mid X)$, but $h(X\mid X)$ (which is the differential entropy of a constat, i.e, a Dirac delta density) is not zero but minus infinity. (If you are not convinced of this, compute the differential entropy of a uniform in $[0,a]$, and let $a\to 0$)

If you mean the true entropy (Shannon entropy), then you can indeed write $I(X;X)=H(X)$, but now $H(X) =+\infty$, because a continous variable (with support over an interval of positive length) has an infinite amount of information.

On both accounts, $I(X;Y) = +\infty$

The moral is : don't believe that the differential entropy is a (Shannon) entropy.