Difference between Gradient Descent method and Steepest Descent

26.2k Views Asked by At

What is the difference between Gradient Descent method and Steepest Descent methods?

In this book, they have come under different sections:

http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

According to page 480, Gradient Descent is:

$$\Delta x=-\nabla f(x)$$

While page 490, for Steepest descent says:

$$\Delta x_{sd}=||\nabla f(x)||_*\Delta x_{nsd}$$ $$\Delta x_{nsd} = \text{argmin}\{\nabla f(x)^Tv~|~~~ ||v||\leq 1\}$$

I cannot understand their difference. How they are mathematically and geometrically different?

6

There are 6 best solutions below

0
On

I am reading this book too, this is also a problem for me for a long time. The direction of gradient descent method is negative gradient. However the direction of steepest descent method is the direction such that

$Δx_{\text{nsd}}=\text{argmin}\{∇f(x)^Tv \quad| \quad ||v||≤1\}$

which is negative gradient only if the norm is euclidean. If the norm is other quadratic or l1norm, the result are not negative gradient.

The direction is -inv(P)*∇f(x), if the norm is quadratic norm.

0
On

Let $\mathbb{M}$ be a linear space, and let $f:\mathbb{M} \to \mathrm{R}$ be a scalar function.

  • The gradient $\nabla_\mathbf{m} f$ is the directional derivative of $f$ at a given point $\mathbf{m} \in \mathbb{M}$ and is defined irrespective of any possible norm over $\mathbb{M}$. The gradient lives in the dual space, i.e. $\nabla_\mathbf{m} f \in \mathbb{M}^*$.

  • To define the direction of steepest descent (or ascent) at a point $\mathbf{m} \in \mathbb{M}$ we must provide a norm over $\mathbb{M}$, for example we might us the $l_1$ or $l_2$ norm. Herein lies the key difference. Given a norm on $\mathbb{M}$ one may consider an infinitesimally small ball around a point $\mathbf{m} \in \mathbb{M}$, and pick the point $\mathbf{m}_{\rm min/max}$ on the boundary of the ball where $f$ attains its smallest/largest value. The direction of steepest descent (or ascent) is defined as the displacement $\delta \mathbf{m}_{\rm min/max} \in \mathbb{M}$ "pointing towards $\mathbf{m}_{\rm min/max}$". It is related to the gradient via basic duality relation between $\mathbb{M}$ and $\mathbb{M}^*$.

For an $l_2$ norm with metric $\mathbf{C}$ this relation is given by $\delta \mathbf{m}_{\rm min/max} = \mathbf{C} \nabla_\mathbf{m} f$. By observing the derivation of hessian based optimisation algorithms such as Newton's method you will see that $\mathbf{C}^{-1}$ is the hessian $\nabla_\mathbf{m}^2 f$.

Summary

The gradient is the directional derivative of a function. The directional of steepest descent (or ascent) is the direction amongst all nearby directions that lowers or raises the value of $f$ the most.

0
On

I happen to also be looking at the same part of the Boyd's Convex Optimization book and thought to give my 2 cents on this matter:

  1. Method of Gradient Descent: only cares about descent in the negative gradient direction.

  2. Method of Steepest Gradient Descent: descents in the direction of the largest directional derivative.

I believe the critical difference here is the directional derivative ($\nabla f(x)^{T}v$ = gradient of $f$ at $x$ in direction $v$ ). The method of Steepest Descent can be viewed as (from Page 476 of Boyd's Convex Optimization book):

i.e., as the direction in the unit ball of $\| \cdot \|$ that extends farthest in the direction $−\nabla f(x)$.

Where the norm $\| \cdot \|$ constrains the direction that you could move to. This limit on the directional gradient changes the behavior of the descent. Like others have said, if you choose $\| \cdot \|_{2}$, the two methods are identical. There are other cases where one would favor an alternative norm for specific problems. For example, if you want to perform descent on a sparse dataset and you want to penalize $\|\cdot \|_{1}$ norm (as a convex relaxation of pseudo-norm $\|\cdot \|_{0}$), then you would probably want to use Steepest Gradient Descent with a $L_{1}$ norm.

0
On

In steepest descent after each backpropagation, the cost function is calculated. If cost has been reduced it continues and learning rate is doubled. If cost has been increased, the learning rate is halved and weights will be set to values of before backpropagation.

1
On

In gradient descent, we compute the update for the parameter vector as $\boldsymbol \theta \leftarrow \boldsymbol \theta - \eta \nabla_{\!\boldsymbol \theta\,} f(\boldsymbol \theta)$. Steepest descent is typically defined as gradient descent in which the learning rate $\eta$ is chosen such that it yields maximal gain along the negative gradient direction. The part of the algorithm that is concerned with determining $\eta$ in each step is called line search.

0
On

We search the new ($(k+1)th$ ) parameter in the direction $\alpha_k \bigtriangleup f(x^{(k)})$. This is called the gradient descent method, wherein $\alpha_k$ is the positive step size. Steepest descent is the same method in which $\alpha_k$ is calculated as $\alpha_k = arg\ min \ f(x_k - \alpha_k \bigtriangleup f(x^{(k)})) $ . read chapter 8 of of the book An Introduction to Optimisation for more on this.