Stochastic Mutual Information Estimator

58 Views Asked by At

I am reading https://openreview.net/forum?id=ByxaUgrFvH and do not understand why they need a "complicated" derivation, because it seems to follow immediately.

Problem

Let $\mathbf{x}$ be a random variable, $E_{\psi}: \mathcal{X} \rightarrow \mathcal{Z}$ be a deterministic model with parameters $\psi$ and $\mathbf{z}=E_{\psi}(\mathbf{x})$. We define $q_{\psi}(\mathbf{x},\mathbf{z})$ as the joint distribution induced by pushing $\mathbf{x}$ though $E_{\psi}$.

We wish to estimate the gradient of the mutual information w.r.t. the model parameters, i.e.

$$\nabla_{\psi}I_{\psi}(\mathbf{x},\mathbf{z})$$

because of the identity $I_{\psi}(\mathbf{x},\mathbf{z})=H(\mathbf{x}) + H_{\psi}(\mathbf{z}) - H_{\psi}(\mathbf{x},\mathbf{z})$ with $H$ being the entropy, we know that

$$\nabla_{\psi}I_{\psi}(\mathbf{x},\mathbf{z}) = \nabla_{\psi}H_{\psi}(\mathbf{z}) - \nabla_{\psi}H_{\psi}(\mathbf{x},\mathbf{z})$$

The authors state that

$$\nabla_{\psi} H(\mathbf{z})=-\nabla_{\psi} \mathbb{E}_{q_{\psi}(\mathbf{z})}[\log q(\mathbf{z})], \quad \nabla_{\psi} H(\mathbf{x}, \mathbf{z})=-\nabla_{\psi} \mathbb{E}_{q_{\psi}(\mathbf{x}, \mathbf{z})}[\log q(\mathbf{x}, \mathbf{z})]$$

Their derivation

They derive this in their Appendix A as follows:

derivation appendix A

My intuition

I do not see why this is necessary, because it seems that this follows directly from the definition of entropy:

$$\nabla_{\psi} H(\mathbf{z})=-\nabla_{\psi} \int_{\mathcal{Z}}q_{\psi}(\mathbf{z})\log q_{\psi}(\mathbf{z})d\mathbf{z}=-\nabla_{\psi} \mathbb{E}_{q_{\psi}(\mathbf{z})}[\log q_{\psi}(\mathbf{z})]$$

Question

Where am I turning wrong? Why is this derivation with the chain rule required?

1

There are 1 best solutions below

1
On BEST ANSWER

Notice that in (21), the notation is very particular, we have $$ -\nabla_\psi \mathbb{E}_{q_\psi} [ \log \mathbf{q}], $$ where I've added the bold for emphasis. This means that the only $q_\psi$ being differentiated is the thing appearing in the expectation. In my opinion this is absolutely terrible notation and prone to causing confusion. I haven't read the paper, but hope that it's justified due to other convenience (because o/w it's just bad writing).

As for the point, notice that there are two places that $\psi$ appears - in the law $q_\psi$ with respect to which the expectation is taken, and in the $q_\psi$ that appears inside the $\log$. Assuming Leibniz's rule applies (which needs some continuity properties), and wlog taking the $\log$ to be natural, we have $$ \nabla_\psi H(q_\psi) = -\int (\nabla_\psi q_\psi) \log q_\psi - \int \frac{q_\psi}{q_\psi} (\nabla q_\psi) = -\int (\nabla q_\psi) \log q_\psi - \nabla \int q_\psi \\ = -\int (\nabla q_\psi) \log q_\psi,$$ where we have used the property that $\int q_\psi = 1$ since it's a probability distribution. Now, the authors choose to denote this final expression as $\nabla_\psi \mathbb{E}_{q_\psi}[ \log q]$, where the understanding is that a partial derivative is carried out first, and then at the end we plug in $q = q_\psi.$

To say this more clearly, one might define a function for a parameter $\psi$ and a distribution $r$ such as $$ D(\psi,r) = \left.\nabla_{\psi'} \mathbb{E}_{q_{\psi'}}[-\log r]\right|_{\psi' = \psi}.$$ Then one could say that $\nabla_\psi H(q_\psi) = D(\psi, q_\psi).$ Depending on the setting, it might be more convenient to just use the integral expression above.