I was reading a tutorial of training a neural network using Newton's method and it says, "The maximum error reduction (of the error surface function) depends on the ratio of the gradient to the curvature. So, a good direction to move in is one with a high ratio of gradient to curvature, even if the gradient itself is small"
Can anybody give an intuitive explanation as to why the high ratio of gradient to curvature is a good direction to move?
I'm assuming by 'curvature' they mean '2nd derivative'. The reason for the formula is pretty simple to derive. Suppose that $y$ is the actual minimum of $f$ and $x$ is the current point. Then the 2nd order Taylor approximation of $f$ about $x$ gives
$$f(y) \approx f(x) + f'(x)(y-x) + \frac{1}{2}f''(x)(y-x)^2$$
If you just differentiate with respect to $y$ and use that $f'(y)=0$ then you get
$$y\approx x-\frac{f'(x)}{f''(x)}.$$
So it's the maximum decrease in the error according to a local approximation of the error by a quadratic function.