My machine learning textbook states the following when discussing second-order Taylor series approximations in the context of Gradient descent:
The (directional) second derivative tells us how well we can expect a gradient descent step to perform. We can make a second-order Taylor series approximation to the function $f(\mathbf{x})$ around the current point $\mathbf{x}^{(0)}$:
$$f(\mathbf{x}) \approx f(\mathbf{x}^{(0)}) + (\mathbf{x} - \mathbf{x}^{(0)})^T \mathbf{g} + \dfrac{1}{2} (\mathbf{x} - \mathbf{x}^{(0)})^T\mathbf{H}(\mathbf{x} - \mathbf{x}^{(0)}),$$
where $\mathbf{g}$ is the gradient and $\mathbf{H}$ is the Hessian at $\mathbf{x}^{(0)}$. If we use a learning rate of $\epsilon$, then the new point $\mathbf{x}$ will be given by $\mathbf{x}^{(0)} - \epsilon \mathbf{g}$. Substituting this into our approximation, we obtain
$$f(\mathbf{x}^{(0)} - \epsilon \mathbf{g}) \approx f(\mathbf{x}^{(0)}) - \epsilon \mathbf{g}^T \mathbf{g} + \dfrac{1}{2} \epsilon^2 \mathbf{g}^T \mathbf{H} \mathbf{g}.$$
There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill.
Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron; Bach, Francis. Deep Learning (Page 85).
I have a couple of points of confusion with regards to this explanation:
Which term is the expected improvement due to the slope of the function, and which is the correction we must apply to account for the curvature of the function?
Why do we need a "correction" to account for the curvature of the function?
Can someone please elaborate on, and be more specific with regards to, what is meant by "when this last term is too large, the gradient descent step can actually move uphill."?
I would greatly appreciate it if people could please take the time to clarify this.
EDIT:
With regards to my first question, I deduced from here that the term $- \epsilon \mathbf{g}^T \mathbf{g}$ is the improvement due to the slope of the function (the improvement in error afforded by the magnitude of the gradient), and the term $\dfrac{1}{2} \epsilon^2 \mathbf{g}^T \mathbf{H} \mathbf{g}$ is the correction to account for the curvature of the function (correction term that incorporates the curvature of the surface as represented by the Hessian matrix).
But, as far as I can tell, $- \epsilon \mathbf{g}^T \mathbf{g}$ isn't a magnitude, since, on an $n$-dimensional Euclidean space $\mathbb{R}^n$, the magnitude/norm/$2$-norm/$L^2$-norm of a vector $\mathbf{x} = (x_1, x_2, \dots, x_n)$ is $||\mathbf{x}||_2 = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2}$. So how is it the "improvement in error afforded by the magnitude of the gradient"?
Let's break Taylor's theorem down a bit. The best linear approximation to $f$ at any given point $x$ is given by the first-order Taylor series: $$ f(x + v) \approx f(x) + \nabla f(x)^T v, $$ where the error is $o(\|v\|)$. You can visualize this for $f : \mathbb R^2 \to \mathbb R$ by realizing that the graph of the linear approximation is the plane tangent to the graph of $f$ at $x$. This is true in higher dimensions, too; just replace "plane" with "hyperplane".
Gradient descent is basically what happens when you attempt to take the best possible step towards minimizing $f$, using only this linear approximation. Since the approximation is linear, it has no local extrema, and especially is not convex. Hence we can't actually locally minimize --- the best that we can do is go along its steepest downward slope. But we can't just go down this slope forever --- we would never stop, because there is no minimum to the linear approximation. Another way to think about this is that the linear approximation contains only directional information. The "learning rate" $\epsilon$ basically says "how close to $x$ do I actually trust my linear approximation --- how far am I willing to step along this gradient?"
Now on to the second-order term: Taylor's theorem says that the best quadratic approximation of $f$ at $x$ is given by the second-order Taylor series: $$ f(x) \approx f(x) + \nabla f(x)^Tv + \frac{1}{2} v^TH(x)v, $$ where the error is $o(\|v\|^2)$. You can picture the graph as something similar to a "tangent paraboloid" at $x$. Hence you can think of the second-order term as a "correction" factor to the linear approximation, adding in local curvature information on top of the $\nabla f(x)^Tv$ term. If you need convincing that the Hessian provides curvature information, recall that $\det H(x)$ is the Gaussian curvature of the graph of $f$ at $x$ (this fact was probably called the second partial derivative test in your 3D calc class).
It should be clear from this dicussion that, using only the linear approximation (gradient descent), we can't go very far at all and still have a good approximation of $f$ (unless $f$ is very close to linear), and in particular there's no reason why a sufficiently large step along the gradient wouldn't increase $f$. Similarly, the author should note that the same thing can happen with the second-order approximation --- step too far, and the approximation diverges from $f$ sufficiently that we have no idea whether $f$ will increase or decrease!
Re: your question about $g^Tg$ as a magnitude, note that $\| g \|_2 = \sqrt{g^Tg}$ --- this is often in fact taken as the definition of the Euclidean norm.
Edit --- I think I misunderstood your question about the gradient magnitude; let me try again:
Let $g = \nabla f(x)$. When computing the gradient descent step, the optimal direction is parallel to $ - g$, i.e. the negative gradient. The actual step that we take (call it $v$), is parallel to the gradient, with some chosen step size: $$ v = -\varepsilon g. $$ Hence the improvement that comes from the linear approximation is $$ (f(x) + g^T v) - f(x) = g^T v = - \varepsilon g^T g. $$ so I think that "improvement afforded by the magnitude of the gradient" is not quite accurate --- the "magnitude of the gradient" isn't really a variable; it's your choice of learning rate $\varepsilon$ that controls the amount of improvement.