Connection between gradient descent and Newton's method

64 Views Asked by At

Given a function $f:\mathbb{R}^d\to \mathbb{R}$, suppose we want to find the minimum of $f$. The gradient descent method finds $x$ that attains the minimum by iterating the following formula:

$x_{n+1} = x_n - \eta \nabla f(x_n) \tag*{}$

where $\eta$ is a constant called the learning rate.

Recently, I was introduced to the method of finding the optimal learning rate in the gradient descent, which is shown as follows.

For simplicity, let $g =\nabla f(x_n)$. By applying the Taylor series approximation centered at $-\eta g$, we have the following:

$f(x_{n+1}) = f(x_n -\eta g) \sim f(x_n) +(-\eta g)^T g + \frac{1}{2}(-\eta g)^T H (-\eta g)$

$ =f(x_n) -\eta g^T g + \frac{1}{2}\eta^2 g^T H g $

where $H = (\frac{\partial f}{\partial x_i \partial x_j})$ is the Hessian matrix.

We want to minimize $f(x_{n+1})-f(x_n)$ so that the gradient descent step is optimal. This means we want to minimize the quadratic function $-\eta g^T g + \frac{1}{2}\eta^2 g^T H g $, and by the basic algebra, we have the optimal learning rate $\eta = \frac{g^T g}{g^T H g}$.

After learning this, I was introduced to Newton's method, which optimizes $f$ by iterating the following:

$x_{n+1} = x_n - H^{-1} g \tag*{}$

When I saw this formula, I was like, hmm, wait a minute, it looks like it has something to do with the optimal learning rate.

Indeed, if we substitute $\eta= \frac{g^T g}{g^T H g}$, by shady mathematics, we have the following:

$x_{n+1}=x_n - \frac{g^T g}{g^T H g} g = x_n - \frac{g^T g}{g^T H} = x_n - \frac{g}{H} = x_n - H^{-1}g \tag*{}$

(I know this is very informal, but bear with me)

So the question is as follows:

(1) Is gradient descent with the optimal learning rate equivalent to Newton's method?

(2) If the answer to (1) is yes, Is there a way to make the above informal argument more rigorous?

1

There are 1 best solutions below

0
On BEST ANSWER

In one dimension, your shady mathematics is legitimate, and the two are the same. In higher dimensions, they are indeed different.

The connection between the two is that they both are the result of choosing $x_{n+1}=x_n+\delta$ such that in the Taylor series

$$f(x_n+\delta)=f(x_n)+\underbrace{\delta^T\nabla f(x_n)+\delta^THf(x_n)\delta}_{=0}+...$$

the underlined terms vanish, and therefore

$$\delta^T\nabla f(x_n)+\delta^THf(x_n)\delta=0$$

for both methods.

However, in the case of gradient descent, we choose this with the constraint that $\delta$ must also be proportional to the gradient, i.e. choosing the direction of $\delta$ and $\nabla f(x_n)$ to be the same $\delta=-\eta\nabla f(x_n)$. For Newton's method, instead of requiring the perturbation $\delta$ to follow the gradient, we require the stronger condition that

$$\varepsilon^T\nabla f(x_n)+\varepsilon^THf(x_n)\delta=0$$

for any vector $\varepsilon$.