Optimization of Variational Distribution covariance parameters in the log space

47 Views Asked by At

I am trying to follow the implementation of a Bayesian Gaussian Process Latent Variable Model [1]. In particular, I have problems understanding the gradient computation of the variational lower bound given in eq. (8).

The second term in the equation is the KL divergence $KL(q||p)$ between the variational posterior distribution $q(X)$ and the prior distribution $p(X)$ over the latent variables. Assuming the prior $p(X)$ has zero mean and unit variance. And the variational distribution $q(X)$ is parametrized by some mean values $\{\mu_1, \dots, \mu_N\}$, and a covariance matrix with the diagonal elements $\{\sigma_1, \dots,\sigma_N \}$. Assuming $N=2$ for simplicity, the KL divergence should be given as:

$KL(q||p) = \frac{1}{2}(-2-\log(\sigma_1 \sigma_2) + \mu_1^2 + \mu_2^2 + \sigma_1 + \sigma_2)$

And consequently the derivative w.r.t. $\sigma_1$ should be:

$\frac{KL(q||p)}{\partial \sigma_1} = \frac{1}{2}-\frac{1}{2\sigma_1}$

However, in the implementation of the authors, the derivative is computed as:

$\frac{KL(q||p)}{\partial \sigma_1} = \frac{1}{2} -\frac{1}{2} \cdot \sigma_1$

With the following comment in the code:

the covars are optimized in the log space (otherwise the $\cdot$ becomes a / and the signs change)

Clearly, replacing the multiplication with a division gives the derivative as I would expect it. Intuitively, I would guess "optimizing the covars in the log space" is some common trick to enforce positiveness or increase numerical stability. However, it is not clear to me how one can derive the used expression for the gradient.

In the code, they also apply an elementwise exponential on $\{\sigma_1, \dots,\sigma_N \}$ before computing the gradient, but I still don't see how this changes the gradient as given.

[1] http://proceedings.mlr.press/v9/titsias10a/titsias10a.pdf