Variational inference: Does the natural gradient follow geodesics locally?

615 Views Asked by At

Amari's natural gradient descent is a well-known optimisation algorithm from information geometry that is well-suited for finding optima of functionals on statistical manifolds. It consists of preconditioning the gradient descent update rule with the inverse of the Fisher information metric tensor.

The objective function I am concerned with is the variational free-energy, or evidence lower bound, in the context of approximate Bayesian inference (i.e. variational Bayes). The metric on the statistical manifold is the Fisher information metric. Also, the particular problem I am working on is categorical inference -- the statistical manifold is the standard simplex and the information length is proportional to the Euclidean metric after a suitable change of coordinates (diffeomorphism onto a quadrant of sphere) -- this might be a special case where my question holds true, but I would also like to know the answer in general.

My intuition for the natural gradient is that it follows (locally) the direction of greatest change of the objective function in information space -- just like normal gradient descent follows the direction of greatest change in Euclidean space. This leads me to ask whether, at the limit where the step-size tends to zero, natural gradient descent follows geodesics locally on the manifold, according to the Fisher information metric.

If this is the case, could you explain how? If not, could you explain why not to help me improve my understanding?

2

There are 2 best solutions below

0
On BEST ANSWER

In order for gradient flow lines on a riemannian manifold $(M,g)$ to follow geodesics, the gradient field $\text{grad}\varphi$ has to be proportional to a vector field which is constant along itself, i.e. there must be a positive function $\sigma$ such that $X=\sigma\cdot\text{grad}\varphi$ satisfies $\nabla_X X=0$. The functions $\varphi$ for which this is true are very rare.

Regarding the specific question about the "natural gradient" flow: on a statistical manifold the geodesics with respect to exponential and mixture connections (the dually flat connections in information geometry) are more fundamental than the geodesics with respect to the Levi-Civita connection. It turns out that the gradient flow of the KL divergence (dual KL divergence) is a time-changed exponential (mixture) geodesic, respectively, and this comes down to the fundamental importance of these divergences and their relation with the corresponding connections.

For further intuition on steepest descent and gradients vs. differentials in general, please also take a look at this related CrossValidated answer.

2
On

No, in general this is not true, although for a surface like a surface of revolution (with $z$-axis symmetry) it will hold. Indeed, you can design surfaces so that following curves of steepest ascent will spiral an arbitrarily long distances to get to the top of a mountain.

See Exercise 28 on p. 78 of my differential geometry text or the article "When Does Water Find the Shortest Path Downhill? The Geometry of Steepest Descent Curves," in The American Mathematical Monthly, December, 2003.