I have fundamental confusion about gradient descent (with line search) and the reason it works. I try to explain my view here, and please tell me where it goes wrong.
Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function.
I think gradient descent (with line search) can be viewed as an alternating descent procedure as below
$1. \,$ Initialize $$ t=0, a_t = 1, x_t=0$$
$2. \,$While not converged
$\quad 2.1\,$Select steepest descent direction $$D_{t+1} = -\nabla f \approx \lim_{r\to0}\text{argmin}_{D: D\in{\mathbb{R}}^n, \|D\|^2=r} f(x_t + a_tD)$$ $\quad 2.2 \,$ Select the best step size (line search) $$a_{t+1}=\text{argmin}_{a: a\in \mathbb{R}, a>0} f(x_t + aD_{t+1})$$ $\quad 2.3 \,$ Update the current point $$x_{t+1}= x_t + a_{t+1} D_{t+1}$$ $\quad 2.4 \,$ Proceed to the next step $$t =t+1 $$
Now I noticed that while $f$ is convex in its input, it is not jointly convex in $a$ and $D$. Then why such an alternating procedure is guaranteed to converge to a global minimum at all?
As I understand it, the confusion is that the problem is convex in terms of $x$, but not (understood to be) convex in terms of the step size $\alpha$ (given the statement about not jointly convex).
However, when we do a line search, we restrict $f$ to another convex set (the line). This, then, is also a convex problem, for which we find the minimum (to give us the step size). This must also bring us closer to the minimum of the original problem. If it does not, then the gradient is zero and we are at the minimum already.
So if this process converges, it must converge to the global minimum. I believe convergence follows from the convergence of ordinary gradient descent, simply because we are choosing the step size to be optimal.