minimizing mutual information vs. maximizing negentropy

551 Views Asked by At

I'm reading about independent component analysis (ICA) where the goal is to find an unmixing matrix to maximize independence in latent, linearly mixed sources.

The paper I'm reading (Hyvarinen) says that mutual information is a natural benchmark for this because it is the information-theoretic measure of independence of random variables. It goes on to say that mutual information can be expressed using the concept of negentropy so that maximizing negentropy is roughly equivalent to minimizing mutual information. This is where I'm getting lost.

Terms in paper:

$$\textbf{differential Entropy: }H(\textbf y) = - \int f(\textbf y) \log f(\textbf y) d \textbf y $$ $$\text{with random vector } \textbf y=(y_1,...,y_n)^T \text{ and density } f(.) $$ $$\textbf{negentropy: } J(\textbf y ) = H(\textbf y_{gauss} ) - H(\textbf y )$$ $$\text{Property: negentropy is invariant for linear transforms}$$ $$\text{mutual information can be expressed in terms of negentropy:}$$ $$\textbf{mutual information: } I(\textbf y) = J(\textbf y) - \sum_i J(y_i) \ \ \text{ (5)}$$

It goes on to say that,

"Because negentropy is invariant for invertible linear transformations, it is now obvious from (5) that finding an invertible transformation $\bf W$ that minimizes the mutual information is roughly equivalent to finding directions in which the negentropy is maximized."

I don't know if this is because I have not studied information theory at all, but this is not obvious to me. Would someone please attempt to give me an intuitive explanation for why this property of negentropy results in the above statement?