Newton's Method vs Gradient Descent?

2.3k Views Asked by At

Currently I am studying logistic regression.

I read online that Gradient Descent is a 1st-order optimisation algorithm, and that Newton's Method is a 2nd order optimisation algorithm.

Does that mean that Gradient Descent cannot be used for multivariate optimisation and that Newton's Method cannot be used for univariate optimisation? Or can Newton's Method be done using a 1st order Taylor polynomial and still be different from Gradient Descent?

These sites are causing me to question:

3

There are 3 best solutions below

9
On

Like in the comments stated; gradient descent and Newton's method are optimization methods, independently if its univariate or multivariate. Gradient descent only uses the first derivative, which sometimes makes it less efficient in multidimensional problems because Newton's method attracts to saddle points. Newton's method uses the curvature of the function (the second derivative) which lead generally faster to a solution if the second derivative is easy to compute. So they can both be used for multivariate and univariate optimization, but the performance will generally not be similar.

0
On

Let us assume that we are trying to minimize $f(x)$ w.r.t $x$, with $f$ differentiable.

Let $\nabla f(x) = f'(x)$ be the first derivative of $f(x)$ w.r.t $x$, and $\nabla^2 f(x) = f'(f'(x)) = f''(x)$ be the second derivative of $f(x)$ w.r.t $x$.

To minimize $f(x)$ w.r.t $x$, we can use either the Gradient Descent method or Newton's Method, both of them are iterative methods.

$\textbf{Gradient Descent method}$ has the form:

$$ x^{t+1} = x^{t} - \eta (\nabla f(x))$$

where $\eta $ is a step size parameter. Here we use the first only. Thus, it is called a 1st-order optimization method.

$\textbf{Newton's Method}$

$$ x^{t+1} = x^{t} - \eta (\frac{\nabla f(x)}{\nabla^2 f(x)}) $$

here we use the first derivative and the second derivative, thus this method is called a 2nd-order optimization method.

0
On
  • Newton's method aims to find the roots of a function $g$. i.e., the $x$ values satisfying $g(x)=0$. The iterative update state is $x_{n+1} = x_n - J_g (x_n)^{-1}g(x_n)$, where $J_g$ is the Jacobian matrix.

  • Gradient descent aims at finding a (local) minimum of a function $f$. A necessary condition for a function $f$'s minimum at $x$ is $\nabla f(x)=0$ (note that it's not sufficient since the critical point can be a maximum or a saddle point too, assuming $f$ to be continuous and diferentiable) and the iterative update step is $x_{n+1} = x_n - \eta\nabla f(x_n)$, where $\eta$ is the learning rate.

  • Hence, the minimization problem $\underset{x}{min} \; f(x)$ can be thought of as a root finding problem $\{x \;|\;\nabla f(x) = 0\}$, assuming that a local minimum exists in some neighborhood, i.e, let $g = \nabla f$ and solve with Newton, with iterative update step being $x_{n+1} = x_n - H_g (x_n)^{-1}\nabla f(x_n)$, where $H$ is the Hessian matrix, hence it becomes a second order method.

  • The Newton's method converges much faster often but it's computationally more expensive.

  • The next couple of animations demonstrate the convergence to (global) minimum value $(0,0)$ for the (convex) function $f(x,y)=x^2+y^2$ with gradient descent (with learning rate $=\eta = 0.05$, takes around $50$ iterations to converge, as shown below) and Newton's method (converges in a single iteration), starting with initial value $(3,3)$.

  • $f = \begin{bmatrix} x^2 \\ y^2 \end{bmatrix}$, $\nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix}$ and $H_f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}$

  • Gradient descent update:

$\quad \quad \begin{bmatrix} x_{n+1} \\ y_{n+1} \end{bmatrix} = \begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix} - \eta \begin{bmatrix} 2x_{n} \\ 2y_{n} \end{bmatrix} = (1-2\eta)\begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix}$

  • As can be seen from above, in this case, gradient descent converges to the minimum in a single step only when $\eta=0.5$.

  • Newton's method update:

$\quad \quad \begin{bmatrix} x_{n+1} \\ y_{n+1} \end{bmatrix} = \begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix} - \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}^{-1}\begin{bmatrix} 2x_{n} \\ 2y_{n} \end{bmatrix}$

$\quad \quad = \begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix} - \begin{bmatrix} \frac{1}{2} & 0 \\ 0 & \frac{1}{2} \end{bmatrix}\begin{bmatrix} 2x_{n} \\ 2y_{n} \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$

Gradient Descent

with $\eta=0.05$

enter image description here

with $\eta=0.25$

enter image description here

with $\eta=0.5$

enter image description here

enter image description here