Shannon/Rajski information distance calculation

43 Views Asked by At

I recently came across a paper of Rajski [Information and Control (1961) 371-377] in which he defines an information distance $$ d(X,Y) = 1 - \frac{H(X)+H(Y)-H(X,Y)}{H(X,Y)} $$ in which $H(\cdot)$ is the usual Shannon entropy $$ H(X) = \sum_{i=0}^{N} x_{i} \log_{2}{(1\big/x_{i})} $$ of a discrete probability distribution $X = \{x_{1},x_{2},\ldots,x_{N}\}$ (which can be a multiset) and $H(\cdot,\cdot)$ is the joint entropy $$ H(X,Y) = \sum_{i=0}^{N_{X}}\sum_{j=0}^{N_{Y}}x_{i}y_{j} \log_{2}{\frac{1}{x_{i}y_{j}}} .$$ Rajski proves that his distance defines a metric space of discrete probability distributions.

Now, I have tried to check the distance properties by taking a very simple discrete probability distribution $X = \{0.6,0.4\}$ and start with the most simple criterion for a distance, namely that $d(X,X) = 0$. And I fail to get zero! Instead my result is equal to one. What am I doing wrong? Where is the error in my thinking or calculation?

Rajski mentions that $d(X,X)=0$ "is equivalent to stating that a matrix of joint probabilities is quasi-diagonal". That means that "such a matrix contains no more than one nonzero element in each row and in each column". I understand that in such a case the formula would indeed give the expected result. But how is such a matrix possible in the first place? A product can only be zero if one of its factors is zero, but any random discrete probability distribution must not necessarily contain zero entries, and in any way they should not matter for the entropy calculation, right? So how can the distance $d(X,X)=0$ for any discrete probability distribution $X$ if no other conditions are given?