Decrease in the size of gradient in gradient descent

Question

Decrease in the size of gradient in gradient descent

1.4k Views Asked by user25004 At 03 Jul 2025 - 1:56

Gradient descent reduces the value of the objective function in each iteration. This is repeated until convergence happens.

The question is if the norm of gradient has to decrease as well in every iteration of gradient descent?

Edit: How about when the objective is a convex function?

Original Q&A

There are 3 best solutions below

Rodrigo de Azevedo On 14 Jun 2016 - 3:16

Suppose we have the following objective function

$$f (x) := \frac{1}{2} x^T Q x - r^T x$$

where $Q \in \mathbb R^{n \times n}$ is symmetric and positive definite and $r \in \mathbb R^n$. From the symmetry of $Q$, we conclude that its eigenvalues are real. From the positive definiteness of $Q$, we conclude that its eigenvalues are positive and that $f$ is a convex function.

Taking the gradient of $f$,

$$\nabla f (x) = Q x - r$$

The gradient vanishes at $\bar{x} := Q^{-1} r$, which is the unique solution to the linear system $Q x = r$.

Doing gradient descent,

$$\begin{array}{rl} x_{k+1} &= x_k - \alpha \nabla f (x_k)\\\\ &= x_k - \alpha Q x_k + \alpha r\\\\ &= (I_n - \alpha Q) x_k + \alpha r\end{array}$$

Hence,

$$x_k = \bar{x} + (I_n - \alpha Q)^k (x_0 - \bar{x})$$

Convergence to $\bar{x}$ happens when

$$|1 - \alpha \lambda_i (Q)| < 1$$

for all $i \in \{1,2,\dots,n\}$, i.e., when

$$0 < \alpha < \frac{2}{\lambda_{\max} (Q)} =: \alpha_{\max}$$

Evaluating the gradient,

$$\begin{array}{rl} \nabla f (x_{k+1}) &= Q x_{k+1} - r\\\\ &= Q (I_n - \alpha Q) x_k + \alpha Q r - r\\\\ &= (I_n - \alpha Q) (Q x_k - r)\\\\ &= (I_n - \alpha Q) \nabla f (x_k)\end{array}$$

Thus,

$$\|\nabla f (x_{k+1})\|_2 \leq \|I_n - \alpha Q\|_2 \|\nabla f (x_k)\|_2$$

If $0 < \alpha < \alpha_{\max}$, then $|1 - \alpha \lambda_i (Q)| < 1$ and $\|I_n - \alpha Q\|_2 < 1$. Thus, if we have convergence, then the $2$-norm of the gradient contracts with each iteration

$$\|\nabla f (x_{k+1})\|_2 < \|\nabla f (x_k)\|_2$$

P. Camilleri On 19 Dec 2023 - 3:42

@Rodrigo de Acevedo's answer only covers the (convex) quadratic case, but it generalizes:

If $f$ is convex and $C^2$, yes the gradient norm decreases if the stepsize if strictly less than $2/L$, where $L$ is the Lipschitz constant of $\nabla f$ (and because here we are in the twice differentiable case, we have $\nabla^2 f(x) \preceq L \, \mathrm{Id}$ for all $x$.

By Taylor's formula applied to $\nabla f$ there exists $z \in [x_k, x_{k+1}]$ such that, with $H = \nabla^2 f(z)$, \begin{equation} \nabla f(x_{k+1}) - \nabla f(x_k) = H(x_{k+1} - x_k) \end{equation} Write $s = x_{k+1} - x_k$, then: \begin{align} \Vert {g_{k+1}} Vert ^2 - \Vert {g_k} \Vert^2 &= \langle Hs, Hs + 2 g_k \rangle \\ &= \langle Hs, Hs - \frac{2}{\eta} s \rangle \\ &= s^\top H^2 s - \frac{2}{\eta} s^\top H s \\ \end{align} Introduce $u = H^{1/2}s$ (need convexity here, because if $H$ is not PSD you can't take its square root!), you end up with $u^\top H u - \frac{2}{\eta} \Vert{u}Vert^2$ which is negative for $0 < \eta < 2 / L$.

**Luca Citi** · Accepted Answer

No. Take for example $f(x)=\sqrt{|x|}$. While $f(x)$ decreases monotonically at each step under the assumption that steps are small enough (or if you apply backstepping to enforce monotonicity), the gradient $(2 \, \mathrm{sign}(x) \, f(x))^{-1}$ gets larger in modulus as the current iterate $x$ approaches the solution $x=0$.

If you don't like the kink in 0 in this example, another simple one where the gradient does not decrease monotonically is $1-\cos(x)$ for $x\in (-\pi, \pi)$ (which is $C^\infty$).

Decrease in the size of gradient in gradient descent

There are 3 best solutions below

Related Questions in OPTIMIZATION

Related Questions in CONVEX-ANALYSIS

Related Questions in CONVEX-OPTIMIZATION

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions