Convergence of gradient descent without global Lipschitz gradient assumption

619 Views Asked by At

My question is very similar to this one and this one, but they haven't been answered.

Let $f \in C^2(\mathbb{R}^d, \mathbb{R})$ have compact sublevel sets and isolated critical points, and consider the gradient descent update $$ x_{k+1} = x_k-\alpha\nabla f(x_k) $$ for some fixed initial point $x_0$ and learning rate $\alpha$. If $f$ has $L$-Lipschitz gradient globally, it is known that $x_k$ converges to a critical point of $f$ for any $0 < \alpha < 2/L$. Now assume we drop the Lipschitz assumption. The set $U_0 = \{ f(x) \leq f(x_0) \}$ is compact and $\nabla f \in C^1$, so we can define $L = \sup_{x \in U} \lVert \nabla^2 f(x) \rVert < \infty$ (in $L^2$ norm).

I would like to prove (or disprove) that $x_k \in U_0$ for all $k$ for all $0 < \alpha < 2/L$. This would imply that $x_k$ converges to a critical point since $f|_U$ is $L$-Lipschitz. The idea would be to prove $f(x_{k+1}) \leq f(x_k)$ and conclude by induction, by Taylor expanding \begin{align*} f(x_{k+1}) &= f(x_k-\alpha \nabla f(x_k)) \\ &= f(x_k) - \alpha \lVert \nabla f(x_k) \rVert^2 + \frac{\alpha^2}{2}\nabla f(x_k)^T\nabla^2 f(x_k-t\alpha\nabla f(x_k))f(x_k) \end{align*} for some $t \in (0, 1)$. Now if we assume $(x_k-t\alpha\nabla f(x_k)) \in U$, we can conclude $$f(x_{k+1}) \leq f(x_k) - \alpha \lVert \nabla f(x_k) \rVert^2\left(1-\frac{\alpha L}{2}\right) \leq f(x_k)$$ for $\alpha < 2/L$, but this is (almost) a circular assumption... Any ideas?

1

There are 1 best solutions below

3
On BEST ANSWER

This holds: here is a proof.$\newcommand{\T}{x}\newcommand{\al}{\alpha}\newcommand{\bal}{\bar{\alpha}}$

Define $U_\al = \{ \T-t\al\nabla f(\T) \mid t \in [0,1], \T\in U_0 \}$ and the continuous function $L(\al) = \sup_{\T \in U_\al} \lVert{\nabla^2 f(\T)}\rVert$. Notice that $U_0 \subset U_{\al}$ for all $\al < \al'$. We prove that $\al L(\al) < 2$ implies $U_\al = U_0$ and in particular, $L(\al) = L(0) = L$. By Taylor expansion,

$$f(\T-t\al\nabla f) = f(\T) - \al \lVert{\nabla f(\T)}\rVert^2 + \frac{t^2\al^2}{2}\nabla f(\T)^T\nabla^2 f(\T-t'\al\nabla f)f(\T) $$

for some $t' \in [0,t] \subset [0,1]$. Since $\T-t'\al\nabla f \in U_\al$, it follows that

$$ f(\T-t\al\nabla f) \leq f(\T) -\al\lVert{\nabla f(\T)}\rVert^2(1-\al L(\al)/2) \leq f(\T) $$

for all $\al L(\al) < 2$. In particular, $\T-t\al\nabla f \in U_0$ and hence $U_\al = U_0$. We conclude that $\al L(\al) < 2$ implies $L(\al)=L$, implying in turn $\al L < 2$. We now claim the converse, namely that $\al L < 2$ implies $\al L(\al) < 2$. For contradiction, assume otherwise that there exists $\al' L < 2$ with $\al'L(\al') \geq 2$. Since $\al L(\al)$ is continuous and $0 L(0) = 0 < 2$, there exists $\bal \leq \al'$ such that $\bal L < 2$ and $\bal L(\bal) = 2$. This is in contradiction with continuity:

$$ 2 = \bal L(\bal) = \lim_{\al\to\bal^-} \al L(\al) = \lim_{\al\to\bal^-} \al L = \bal L \,. $$

Finally we conclude that $U_\al = U_0$ for all $\al L < 2$. In particular, $\T_0 \in U_0$ implies $\T_k \in U_0$ by induction.