Amari's natural gradient descent is a well-known optimisation algorithm from information geometry that is well-suited for finding optima of functionals on statistical manifolds. It consists of preconditioning the gradient descent update rule with the inverse of the Fisher information metric tensor.
The objective function I am concerned with is the variational free-energy, or evidence lower bound, in the context of approximate Bayesian inference (i.e. variational Bayes). The metric on the statistical manifold is the Fisher information metric. Also, the particular problem I am working on is categorical inference -- the statistical manifold is the standard simplex and the information length is proportional to the Euclidean metric after a suitable change of coordinates (diffeomorphism onto a quadrant of sphere) -- this might be a special case where my question holds true, but I would also like to know the answer in general.
My intuition for the natural gradient is that it follows (locally) the direction of greatest change of the objective function in information space -- just like normal gradient descent follows the direction of greatest change in Euclidean space. This leads me to ask whether, at the limit where the step-size tends to zero, natural gradient descent follows geodesics locally on the manifold, according to the Fisher information metric.
If this is the case, could you explain how? If not, could you explain why not to help me improve my understanding?
In order for gradient flow lines on a riemannian manifold $(M,g)$ to follow geodesics, the gradient field $\text{grad}\varphi$ has to be proportional to a vector field which is constant along itself, i.e. there must be a positive function $\sigma$ such that $X=\sigma\cdot\text{grad}\varphi$ satisfies $\nabla_X X=0$. The functions $\varphi$ for which this is true are very rare.
Regarding the specific question about the "natural gradient" flow: on a statistical manifold the geodesics with respect to exponential and mixture connections (the dually flat connections in information geometry) are more fundamental than the geodesics with respect to the Levi-Civita connection. It turns out that the gradient flow of the KL divergence (dual KL divergence) is a time-changed exponential (mixture) geodesic, respectively, and this comes down to the fundamental importance of these divergences and their relation with the corresponding connections.
For further intuition on steepest descent and gradients vs. differentials in general, please also take a look at this related CrossValidated answer.