Does the method of steepest decent always move in an orthogonal direction between iterations?

594 Views Asked by At

I understand everything, I think, about the method but the result (or requirement) that successive steps are orthogonal to each other.

So, with the formula for this algorithm as:

$$\mathbf{x}_{n+1}=\mathbf{x}_n-\gamma_n \nabla F(\mathbf{x}_n)$$

I understand the role of $\gamma$ (a factor that controls your step size, but also direction?), and the idea of the gradient (the steepest direction of descent in the objective function space), but not why the zig-zag pattern is so common - or it is necessary as a result of the algorithm?

I've seen it stated that $\gamma_n$ should be chosen so that $\nabla F(\mathbf{x}_{n+1})$ and $\nabla F(\mathbf{x}_n)$ are orthogonal, but I am unsure as to why....

Thanks.

1

There are 1 best solutions below

0
On

Consider the Taylor series up to second order of \begin{align} F\big(\boldsymbol x_{n+1} \big) &= F\big(\boldsymbol x_{n} - \gamma_n \nabla F(\boldsymbol x_{n} ) \big) \\ &=F(\boldsymbol x_{n} ) - \gamma_n \nabla F(\boldsymbol x_{n} )^T \nabla F(\boldsymbol x_{n} ) + \frac12 \gamma_n ^2 \nabla F(\boldsymbol x_{n} )^T \nabla^2 F(\boldsymbol x_{n} ) \nabla F(\boldsymbol x_{n} ) + \mathcal O \Big( \Vert \nabla F(\boldsymbol x_{n} ) \Vert^3 \Big). \end{align} For fixed $\boldsymbol x_{n} $ this is a univariate, scalar function in $\gamma_n$. Neglecting the $\mathcal O \Big( \Vert \nabla F(\boldsymbol x_{n} ) \Vert^3 \Big)$ term gives a quadratic function in $\gamma_n$ termed $$ h(\gamma_n) :=F(\boldsymbol x_{n} ) - \gamma_n \nabla F(\boldsymbol x_{n} )^T \nabla F(\boldsymbol x_{n} ) + \frac12 \gamma_n ^2 \nabla F(\boldsymbol x_{n} )^T \nabla^2 F(\boldsymbol x_{n} ) \nabla F(\boldsymbol x_{n} ) $$ with a unique minimum (recall that we perform gradient descent to minimize $F$) attended at $\gamma_n^\star$ where $h'(\gamma_n^\star) = 0$. Introducing the shortcuts $\boldsymbol g_n := \nabla F(\boldsymbol x_{n} )$, $H_n := \nabla^2 F(\boldsymbol x_{n} )$ one obtains $$\gamma_n^\star = \frac{\boldsymbol g_n^T \boldsymbol g_n}{\boldsymbol g_n^T H_n \boldsymbol g_n}.$$

If you no go ahead and use this (up to second order optimal $\gamma_n^\star$) in the update formula for $\boldsymbol x_{n+1}$, $$\boldsymbol x_{n+1} = \boldsymbol x_{n} - \frac{\boldsymbol g_n^T \boldsymbol g_n}{\boldsymbol g_n^T H_n \boldsymbol g_n} \boldsymbol g_n$$ you can again Taylor expand (this time only up to first order) $$ \nabla F\big( \boldsymbol x_{n+1} \big) = \nabla F \Bigg( \boldsymbol x_{n} - \frac{\boldsymbol g_n^T \boldsymbol g_n}{\boldsymbol g_n^T H_n \boldsymbol g_n} \boldsymbol g_n \Bigg) \approx \nabla F\big( \boldsymbol x_n \big) - \frac{\boldsymbol g_n^T \boldsymbol g_n}{\boldsymbol g_n^T H_n \boldsymbol g_n} \nabla^2 F\big( \boldsymbol x_n \big) \boldsymbol g_n. $$ Then take a look at the scalar product \begin{align} \nabla F\big( \boldsymbol x_{n} \big)^T \nabla F\big( \boldsymbol x_{n+1} \big) =& \nabla F\big( \boldsymbol x_{n} \big)^T\Bigg[\nabla F\big( \boldsymbol x_n \big) - \frac{\boldsymbol g_n^T \boldsymbol g_n}{\boldsymbol g_n^T H_n \boldsymbol g_n} \nabla^2 F\big( \boldsymbol x_n \big) \boldsymbol g_n \Bigg] \\ =&\boldsymbol g_n^T \boldsymbol g_n - \frac{\boldsymbol g_n^T \boldsymbol g_n}{\boldsymbol g_n^T H_n \boldsymbol g_n} \boldsymbol g_n^T H_n \boldsymbol g_n = 0 \end{align}

To summarize, for the optimal step length $\gamma_n^\star$ based on a quadratic approximation, the gradients of two successive steps are orthogonal (up to first order).