Decrease in the size of gradient in gradient descent

1.4k Views Asked by At

Gradient descent reduces the value of the objective function in each iteration. This is repeated until convergence happens.

The question is if the norm of gradient has to decrease as well in every iteration of gradient descent?


Edit: How about when the objective is a convex function?

3

There are 3 best solutions below

2
On BEST ANSWER

No. Take for example $f(x)=\sqrt{|x|}$. While $f(x)$ decreases monotonically at each step under the assumption that steps are small enough (or if you apply backstepping to enforce monotonicity), the gradient $(2 \, \mathrm{sign}(x) \, f(x))^{-1}$ gets larger in modulus as the current iterate $x$ approaches the solution $x=0$.

If you don't like the kink in 0 in this example, another simple one where the gradient does not decrease monotonically is $1-\cos(x)$ for $x\in (-\pi, \pi)$ (which is $C^\infty$).

0
On

Suppose we have the following objective function

$$f (x) := \frac{1}{2} x^T Q x - r^T x$$

where $Q \in \mathbb R^{n \times n}$ is symmetric and positive definite and $r \in \mathbb R^n$. From the symmetry of $Q$, we conclude that its eigenvalues are real. From the positive definiteness of $Q$, we conclude that its eigenvalues are positive and that $f$ is a convex function.

Taking the gradient of $f$,

$$\nabla f (x) = Q x - r$$

The gradient vanishes at $\bar{x} := Q^{-1} r$, which is the unique solution to the linear system $Q x = r$.

Doing gradient descent,

$$\begin{array}{rl} x_{k+1} &= x_k - \alpha \nabla f (x_k)\\\\ &= x_k - \alpha Q x_k + \alpha r\\\\ &= (I_n - \alpha Q) x_k + \alpha r\end{array}$$

Hence,

$$x_k = \bar{x} + (I_n - \alpha Q)^k (x_0 - \bar{x})$$

Convergence to $\bar{x}$ happens when

$$|1 - \alpha \lambda_i (Q)| < 1$$

for all $i \in \{1,2,\dots,n\}$, i.e., when

$$0 < \alpha < \frac{2}{\lambda_{\max} (Q)} =: \alpha_{\max}$$

Evaluating the gradient,

$$\begin{array}{rl} \nabla f (x_{k+1}) &= Q x_{k+1} - r\\\\ &= Q (I_n - \alpha Q) x_k + \alpha Q r - r\\\\ &= (I_n - \alpha Q) (Q x_k - r)\\\\ &= (I_n - \alpha Q) \nabla f (x_k)\end{array}$$

Thus,

$$\|\nabla f (x_{k+1})\|_2 \leq \|I_n - \alpha Q\|_2 \|\nabla f (x_k)\|_2$$

If $0 < \alpha < \alpha_{\max}$, then $|1 - \alpha \lambda_i (Q)| < 1$ and $\|I_n - \alpha Q\|_2 < 1$. Thus, if we have convergence, then the $2$-norm of the gradient contracts with each iteration

$$\|\nabla f (x_{k+1})\|_2 < \|\nabla f (x_k)\|_2$$

0
On

@Rodrigo de Acevedo's answer only covers the (convex) quadratic case, but it generalizes:

If $f$ is convex and $C^2$, yes the gradient norm decreases if the stepsize if strictly less than $2/L$, where $L$ is the Lipschitz constant of $\nabla f$ (and because here we are in the twice differentiable case, we have $\nabla^2 f(x) \preceq L \, \mathrm{Id}$ for all $x$.

By Taylor's formula applied to $\nabla f$ there exists $z \in [x_k, x_{k+1}]$ such that, with $H = \nabla^2 f(z)$, \begin{equation} \nabla f(x_{k+1}) - \nabla f(x_k) = H(x_{k+1} - x_k) \end{equation} Write $s = x_{k+1} - x_k$, then: \begin{align} \Vert {g_{k+1}} Vert ^2 - \Vert {g_k} \Vert^2 &= \langle Hs, Hs + 2 g_k \rangle \\ &= \langle Hs, Hs - \frac{2}{\eta} s \rangle \\ &= s^\top H^2 s - \frac{2}{\eta} s^\top H s \\ \end{align} Introduce $u = H^{1/2}s$ (need convexity here, because if $H$ is not PSD you can't take its square root!), you end up with $u^\top H u - \frac{2}{\eta} \Vert{u}Vert^2$ which is negative for $0 < \eta < 2 / L$.