Decrease in the size of gradient in gradient descent

1.3k Views Asked by At

Gradient descent reduces the value of the objective function in each iteration. This is repeated until convergence happens.

The question is if the norm of gradient has to decrease as well in every iteration of gradient descent?


Edit: How about when the objective is a convex function?

2

There are 2 best solutions below

2
On BEST ANSWER

No. Take for example $f(x)=\sqrt{|x|}$. While $f(x)$ decreases monotonically at each step under the assumption that steps are small enough (or if you apply backstepping to enforce monotonicity), the gradient $(2 \, \mathrm{sign}(x) \, f(x))^{-1}$ gets larger in modulus as the current iterate $x$ approaches the solution $x=0$.

If you don't like the kink in 0 in this example, another simple one where the gradient does not decrease monotonically is $1-\cos(x)$ for $x\in (-\pi, \pi)$ (which is $C^\infty$).

0
On

Suppose we have the following objective function

$$f (x) := \frac{1}{2} x^T Q x - r^T x$$

where $Q \in \mathbb R^{n \times n}$ is symmetric and positive definite and $r \in \mathbb R^n$. From the symmetry of $Q$, we conclude that its eigenvalues are real. From the positive definiteness of $Q$, we conclude that its eigenvalues are positive and that $f$ is a convex function.

Taking the gradient of $f$,

$$\nabla f (x) = Q x - r$$

The gradient vanishes at $\bar{x} := Q^{-1} r$, which is the unique solution to the linear system $Q x = r$.

Doing gradient descent,

$$\begin{array}{rl} x_{k+1} &= x_k - \alpha \nabla f (x_k)\\\\ &= x_k - \alpha Q x_k + \alpha r\\\\ &= (I_n - \alpha Q) x_k + \alpha r\end{array}$$

Hence,

$$x_k = \bar{x} + (I_n - \alpha Q)^k (x_0 - \bar{x})$$

Convergence to $\bar{x}$ happens when

$$|1 - \alpha \lambda_i (Q)| < 1$$

for all $i \in \{1,2,\dots,n\}$, i.e., when

$$0 < \alpha < \frac{2}{\lambda_{\max} (Q)} =: \alpha_{\max}$$

Evaluating the gradient,

$$\begin{array}{rl} \nabla f (x_{k+1}) &= Q x_{k+1} - r\\\\ &= Q (I_n - \alpha Q) x_k + \alpha Q r - r\\\\ &= (I_n - \alpha Q) (Q x_k - r)\\\\ &= (I_n - \alpha Q) \nabla f (x_k)\end{array}$$

Thus,

$$\|\nabla f (x_{k+1})\|_2 \leq \|I_n - \alpha Q\|_2 \|\nabla f (x_k)\|_2$$

If $0 < \alpha < \alpha_{\max}$, then $|1 - \alpha \lambda_i (Q)| < 1$ and $\|I_n - \alpha Q\|_2 < 1$. Thus, if we have convergence, then the $2$-norm of the gradient contracts with each iteration

$$\|\nabla f (x_{k+1})\|_2 < \|\nabla f (x_k)\|_2$$