I am reading Theorem 2 of Natural Gradient Works Efficiently in Learning, Amari. It's about the fisher efficiency of the natural gradient method (NGD).
Let $D_T = \{ (x_1, y_1), . . . , (x_T, y_T) \}$ be $T$-independent input-output examples generated by the neural network having parameter $w^∗$. And we define the log loss to be minimized, $l(x,y; w) = - log (p(x,y;w))$.
The update rule of (NGD) follows as $\tilde{w}_{t+1} = \tilde{w}_{t} - \frac{1}{t} \tilde{\nabla} l(x_t, y_t, \tilde{w_t})$. The $\tilde{\nabla} l(w)$ means the steepest direction in the Riemannian manifold, that is, $G^{-1} \nabla l(x_t, y_t, \tilde{w_t})$, where the $G$ is a Fisher information matrix $G(w) = E [ (-\frac{\partial}{\partial w} l(x,y; w)) (-\frac{\partial}{\partial w} l(x,y; w))^T ]$.
Let us denote the covariance matrix of estimator $\tilde{w}_{t}$ by $\tilde{V}_{t} = E[ (\tilde{w}_{t} - w^*)( \tilde{w}_{t} - w^*) ^T ]$
On the other hand, by the taylor expansion, $\frac{\partial l(x_t, y_t, \tilde{w_t})}{\partial w} = \frac{\partial l(x_t, y_t, \tilde{w_t})}{\partial w} + \frac{\partial^2 l(x_t, y_t, \tilde{w_t})}{\partial w \partial w} + \mathcal{O}(|\tilde{w}_t - w^*|)$.
By subtracting w∗ from both sides of [$\tilde{w}_{t+1} = \tilde{w}_{t} - \frac{1}{t} \tilde{\nabla} l(x_t, y_t, \tilde{w_t})$] and taking the expectation of the square of the both sides , it says we have $\tilde{V}_{t+1} = \tilde{V}_{t} - \frac{2}{t} \tilde{V}_{t} + \frac{1}{t^2} G^{-1} + \mathcal{O} ( \frac{1}{t^3} ) $.
It used the following three equations in this procedure. $E[ \frac{\partial l(x,y; w^*) }{\partial w} ] = 0, E[ \frac{\partial l(x,y; w^*) }{\partial w \partial w} ] = G(w^*), G(\tilde{w}_t) = G(w^*) + \mathcal{O} (\frac{1}{t})$
But I can't understand. How can we derive [$\tilde{V}_{t+1} = \tilde{V}_{t} - \frac{2}{t} \tilde{V}_{t} + \frac{1}{t^2} G^{-1} + \mathcal{O} ( \frac{1}{t^3} ) $]?
Thank you