How to improve the numerical stability of the inverse rank-one Cholesky update?

217 Views Asked by At

I am trying to use the inverse Cholesky update from the page 10 of the Efficient covariance matrix update for variable metric evolution strategies paper as a part of the optimization step in a neural network and am struggling significantly as it is so unstable. There is nothing wrong with the logic of it, but I've found that it requires really low learning rates $\beta$ and even then works quite poorly. The full reasons for that are unknown to me, but there is some indication that the expression as originally shown is quite numerically unstable. I tend to set $\alpha=\beta-1$.

$$ A^{-1}_{t+1} = \frac 1 {\sqrt \alpha} A^{-1}_t - \frac 1 {\sqrt \alpha \left\|z_t\right\|^2} \left(1 - \frac 1 {\sqrt {1 + \frac \beta \alpha \left\|z_t\right\|^2}} \right) z_t [z^T_tA^{-1}_t] $$

By distributing $\sqrt \alpha$, I think I've managed to find the first place on how to improve this expression.

$$ A^{-1}_{t+1} = \frac 1 {\sqrt \alpha} A^{-1}_t - \frac 1 {\left\|z_t\right\|^2} \left(\frac 1 {\sqrt \alpha} - \frac 1 {\sqrt {\alpha + \beta \left\|z_t\right\|^2}} \right) z_t [z^T_tA^{-1}_t] $$

I've yet to test this, but I have a reason to expect this would be better. While testing the back-whitening in the last layer I had the situation that the inverse Cholesky factor was not updating at all for some reason. Looking into it the square L2 norm $\left\|z_t\right\|^2$ was around $10^{-3}$ while the learning rate $\beta$ was really low around $10^{-5}$ due to higher ones diverging. Hence what happened was that $\sqrt {1 + \frac \beta \alpha \left\|z_t\right\|^2}$ always evaluated to zero and no updates ever took place because $1 + 10^{-8} = 1$ with float32 numbers.

Distributing the $\sqrt \alpha$ definitely feels right here, but I am hardly an expert in numerical optimization and am just going off my intuition as a programmer.

Are there any more moves I could take here to make the expression behave better?