I'm trying to understand a proof that taking a small enough step along the descent direction decreases the objective function in gradient descent methods.
Given a vector $x \in \mathbb{R}^n$ with $\nabla f(x) \neq 0$ and the half-line of vectors:
$\{x_\alpha = x - \alpha \nabla f(X): \alpha > 0 \} $
It says that a first-order Taylor approximation around x yields:
$\begin{align}f(x_{\alpha}) &= f(x) + \nabla f(x)^T(x_\alpha - x) + o(||x_\alpha - x||) \\ &= f(x) - \alpha ||\nabla f(x)||^2 + o(\alpha ||\nabla f(x)||) \\ &= f(x) - \alpha ||\nabla f(x)||^2 + o(\alpha) \end{align} $
However it is quite unclear to me what the term $o(||x_\alpha - x||)$ exactly means, where its from and why it can be simplified to $o(\alpha)$ in the last step.
Recalling the multivariate Taylor's theorem, we find that the error in a Taylor expansion (up to degree $1$) is $$ \sum_{i=1}^n h_i(x_\alpha)((x_\alpha)_i-\alpha_i) $$ where $h_i(x_\alpha)$ approaches zero as $x_\alpha$ approaches $x$. In this case, we get that the error is bounded from above by $$ \|x_\alpha-x\|\sum_{i=1}^nh_i(x_\alpha). $$
Now, consider the relationship between this and $C\|x_\alpha-x\|$ for any constant $C$. Since the $h_i$'s approach zero, for any $C$ you choose, with $\alpha$ sufficiently small, the entire error is dominated by $C\|x_\alpha-x\|$. This is what the little-oh notation means here: In terms of $\alpha$, when $\alpha$ gets small, the error is going to zero faster than $\|x_\alpha-x\|$ does. You can think of this as saying that the error is significantly smaller than $\|x_\alpha-x\|$.
The little-oh notation is about growth rates, so it doesn't depend on constants. In this problem, $\|\nabla f(x)\|$ is really just a constant since $x$ is fixed. Therefore, this factor can be absorbed into the little-oh.
Therefore, the last line is saying that $f(x_\alpha)$ is equal to $f(x)-\alpha\|\nabla f(x)\|^2$ plus an error term. The error term is $o(\alpha)$, meaning that it goes to zero faster than $\alpha$ does. Therefore, for $\alpha$ sufficiently small, ehe error term is significantly smaller than $\alpha\|\nabla f(x)\|^2$.