I'm reading Nesterov's book Introductory Lectures on Convex Optimization. On p. 22 he proves that if $f:\mathbb{R}^n \to \mathbb{R}$ is continuously differentiable, and $\nabla f$ has Lipshitz constant $L$, then
$$|f(y) - f(x) - \langle \nabla f(x), y - x\rangle | \leq \frac{L}{2}\|y - x\|^2$$
This inequality is marked with the tag (1.2.5).
Then on page 26 he is considering the gradient method for unconstrained minimization of $f$. The iteration algorithm is that
$$x_{k+1} \leftarrow x_k - h \cdot \nabla f(x)$$
for some step size $h \in \mathbb{R}^{>0}$. He writes:
A couple of questions:
- How does he get to the step $f(y) \leq f(x) + \langle f'(x), y-x \rangle + \frac{L}{2}\|y-x\|^2$? How does he get rid of the absolute value in the previous inequality (1.2.5)?
- He writes "one step of the gradient method decreases the value...at least...$\frac{1}{2L}\|f'(x)\|^2$. I suppose he means if we always choose the value of $h$ that makes our upper bound sharpest, right? But this is not the case for all our algorithms. (He covers various ways to choose step size next, and of course they don't all choose $h=\frac{1}{L}$.) So it's not really true to say that the gradient method always decreases it by this much, is it?

$| z | \leq M$
then
$z \leq M$
Thus
$f(y)-f(x)-\nabla f(x)^{T}(y-x) \leq \frac{L}{2} \| y- x \|^{2}$
and
$f(y) \leq f(x) + \nabla f(x)^{T}(y-x) + \frac{L}{2} \| y-x \|^{2}$.