Difference between Gradient Descent method and Steepest Descent

26.2k Views Asked by At

What is the difference between Gradient Descent method and Steepest Descent methods?

In this book, they have come under different sections:

http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

According to page 480, Gradient Descent is:

$$\Delta x=-\nabla f(x)$$

While page 490, for Steepest descent says:

$$\Delta x_{sd}=||\nabla f(x)||_*\Delta x_{nsd}$$ $$\Delta x_{nsd} = \text{argmin}\{\nabla f(x)^Tv~|~~~ ||v||\leq 1\}$$

I cannot understand their difference. How they are mathematically and geometrically different?

2

There are 2 best solutions below

0
On

I am reading this book too, this is also a problem for me for a long time. The direction of gradient descent method is negative gradient. However the direction of steepest descent method is the direction such that

$Δx_{\text{nsd}}=\text{argmin}\{∇f(x)^Tv \quad| \quad ||v||≤1\}$

which is negative gradient only if the norm is euclidean. If the norm is other quadratic or l1norm, the result are not negative gradient.

The direction is -inv(P)*∇f(x), if the norm is quadratic norm.

0
On

I happen to also be looking at the same part of the Boyd's Convex Optimization book and thought to give my 2 cents on this matter:

  1. Method of Gradient Descent: only cares about descent in the negative gradient direction.

  2. Method of Steepest Gradient Descent: descents in the direction of the largest directional derivative.

I believe the critical difference here is the directional derivative ($\nabla f(x)^{T}v$ = gradient of $f$ at $x$ in direction $v$ ). The method of Steepest Descent can be viewed as (from Page 476 of Boyd's Convex Optimization book):

i.e., as the direction in the unit ball of $\| \cdot \|$ that extends farthest in the direction $−\nabla f(x)$.

Where the norm $\| \cdot \|$ constrains the direction that you could move to. This limit on the directional gradient changes the behavior of the descent. Like others have said, if you choose $\| \cdot \|_{2}$, the two methods are identical. There are other cases where one would favor an alternative norm for specific problems. For example, if you want to perform descent on a sparse dataset and you want to penalize $\|\cdot \|_{1}$ norm (as a convex relaxation of pseudo-norm $\|\cdot \|_{0}$), then you would probably want to use Steepest Gradient Descent with a $L_{1}$ norm.