Drawbacks of gradient descent

635 Views Asked by At

In the method of gradient descent, which in each iteration we reduce the problem to a single variable optimization problem in order to find $\alpha$ (called the learning rate?) using the update rule:

$$\mathbf{x}_{t+1} = \mathbf{x}_t-\alpha\nabla f(\mathbf{x}_t)$$

Drawbacks:

  1. computational: we must evaluate the gradient, of at least estimate it using finite differences.

  2. order of convergence, due to the fact that $$\nabla f (\mathbf{x}_{t+1})^T\nabla f(\mathbf{x}_t) = 0$$ we will obtain a "zig-zag" pattern which will increase the number of iterations.

When will the gradient descent do a $90$ degrees turn? On the one hand we see that

$$\nabla f (\mathbf{x}_{t+1})^T\nabla f(\mathbf{x}_t) = 0$$

On the other hand we know that the gradient is perpendicular to the level set?

I see that in some case we start with a random starting point and set a learning rate $\alpha$ and in other cases, we just start with staring point, and find $\alpha$ using line search methods, is this the same optimization method?

3

There are 3 best solutions below

0
On

Technically they belong to the same optimization method: namely that of steepest descent or gradient descent.

The strategy in line search is to evaluate $f(\mathbf{x}-\alpha\nabla_x f(\mathbf{x})$ for several values of $\alpha$ and choose the one that results in the smallest objective function value, while the more popular approach is to set $\alpha$ to a small constant.

Line search and using a (constant) learning rate are thus two different strategies but belong to the same optimization method.

0
On

In general these are considered different algorithms. For gradient based algorithms, typically, the differences arise in conditions under which line searches are terminated.

For example, if an algorithm performs an exact line search on a smooth objective function $f$, then at each iteration, the following sub-problem is solved

$$\begin{equation}\label{1} \min_{\alpha} \quad \phi(\alpha) := f(x_k - \alpha g_k) \tag{1}.\end{equation}$$

Here $g_k = \nabla f(x_k)$. If $\alpha^*$ is the solution of \ref{1}, then the next iterate is $x_{k+1} = x_k - \alpha^*g_k $ and it follows that $\phi^{\prime}(\alpha^*) = g_{k+1}^T g_k = 0$, but this is not true if the sub-problem is not solved exactly.

As solving the subproblem is, in general, not an easy thing to do, it is solved approximately using some variant of bracketing or interpolation schemes. More practical approaches are to use line search termination conditions such as Armijo or Wolfe or Goldstein condition, which take a step whenever sufficient decrease is guaranteed. These conditions are primarily designed so that the algorithm terminates.

0
On

I recommend a reading of Chapter 3 of the review paper "Optimization Methods for Large-Scale Machine Learning" (published on SIAM Review), which contains a through overview of classical gradient descent method (although the emphasis is on the stochastic gradient descent and its various variants/refinements). A more basic introduction to the analysis of (stochastic) gradient descent can be found in Chapter 14 of the book "Understanding Machine Learning - From Theory to Algorithms" written by Shai Shalev-Shwartz and Shai Ben-David.