How does one prove that gradient descent indeed decreases the function in question locally? In other words if we take a step in the negative of the gradient as in:
$$ x^{(t+1)} = x^{(t)} - \eta \nabla f( x^{(t)} )$$
how does one show rigorously that (for a correctly chosen step size, or sufficiently small step size) we have:
$$ f(x^{(t+1)}) \leq f( x^{(t)} )$$
I am not sure what type of assumptions one needs for such a mathematical statement to hold but it seems obvious to me that it should be true for (locally) convex functions.
I don't know how to attempt this rigorously but this is what I have tried (its a hand wavy but sort of correct attempt with the mathematics that I remember):
Recall the taylor approximation for a (single variable) function:
$$ f(x + \epsilon) = f(x) + \epsilon f'(x) + \frac{ \epsilon^2 }{2} f''(x)+ \frac{ \epsilon^3 }{3!} f'''(x) + \cdots $$
$$ f(x + \epsilon) = \left[ f(x) + \epsilon f'(x) \right] + \left[ \frac{ \epsilon^2 }{2} f''(x)+ \frac{ \epsilon^3 }{3!} f'''(x) + \cdots \right] = Linear(x,\epsilon)+Err(x,\epsilon)$$
where $ Linear(x,\epsilon) = f(x) + \epsilon f'(x) $ is the linear approximation for $f$ for small $\epsilon$ and $Err(x,\epsilon) = \frac{ \epsilon^2 }{2} f''(x) + \frac{ \epsilon^3 }{3!} f'''(x) + \cdots$ is the error term and its about $ Err(x,\epsilon) = O( \epsilon^2 f''(x) )$ for really small $\epsilon$ (in this case $\epsilon^2$ is the leading term).
Say that we allow $\epsilon = - \eta f'(x)$. In that case we have:
$$ f(x - \eta f'(x)) = f(x) - \eta f'(x)^2 + \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) + \cdots $$
intuitively what we'd want is to choose $\eta$ such that the largest negative term (i.e. $\eta f'(x)^2$ term) is large enough in magnitude to be larger than $Err(x,\epsilon)$ and therefore, actually decrease the function. I think if we have $ \eta f'(x)^2 $ be larger than the leading term in the error, then the function $f$ should decrease (since the linear part which is the largest decreases). So we have:
$$ - \eta f'(x)^2 + \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) \leq 0 \iff \eta f'(x)^2 \geq \frac{ \eta^2 {f'(x)}^2 }{2} f''(x)$$
if we restrict ourselves to choose $\eta > 0 $ (to guarantee a step size in the direction that actually decreases the function i.e. following the negative of the gradient) we have:
$$1 \geq \frac{ \eta }{2} f''(x) \implies \frac{2}{f''(x)} \geq\eta > 0$$
I think the above step size makes sure the function decreases $f(x^{(t+1)} \leq f(x^{(t)}$.
Consider ignoring the terms from the error that are too small ($\epsilon$'s with power greater than 2)
$$ f(x^{(t+1)}) = f(x - \eta f'(x)) \approx f(x) - \eta f'(x)^2 + \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) $$
where we choose $\eta$ such that $ - \eta f'(x)^2 + \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) \leq 0$ was true. Therefore we must have:
$$ f(x^{(t+1)}) \approx f(x) - \eta f'(x)^2 + \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) = f(x) + ( \text{term } \leq 0 ) $$
So since the term $- \eta f'(x)^2 + \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) = \frac{ \eta^2 {f'(x)}^2 }{2} f''(x) - \eta f'(x)^2 \leq 0 $. The function $f$ must be decreasing (or staying the same) at each gradient descent step. i.e. $f(x^{(t+1)}) \leq f(x^{(t)})$.
I know the argument I presented is not a real proof, but I was wondering if someone that knew more rigorous analysis or optimization could help me polish of the rough corners of this nearly correct proof.
[Note that the proof is actually rigorous for the case that $f(x)$ has a first and second derivative and derivatives of higher order are zero. An example of such function is a quadratic function]
[Feel free to generalize my attempt to multivariable functions]
an extensive proof can be found here, although some tiny details are missing, hence I would like to put this link up as well, which is the entire proof that I have put up on my github. Since the answer is way too long, and it would become an unnecessary double-writing, I decided to attach the links.