Derivation of the optimal step size of steepest descent

1.3k Views Asked by At

Let be $$f(x) = \frac{1}{2}x^TAx +b^Tx$$

where $A$ is a symmetric positive definite matrix and $b \in \mathbb{R}^d$. We know that such a function is systematically twice (and even infinitely) differentiable with

$$ \begin{cases} \nabla f(x) = Ax + b \\ \nabla^2 f(x) = A \\ \end{cases} $$

Let us note $g_k:=\nabla f(x^{(k)})$, so that here $g_k = Ax^{(k)}+b$. In this case, the optimal step size in direction $-g_k$ can be obtained analytically by solving

$$ \begin{align} 0 &= \frac{d}{d\lambda}f(x^{(k)}-\lambda g_k)\tag{1}\label{eq1} \\ &= \langle A(x^{(k)}-\lambda g_k) + b, - g_k\rangle\tag{2}\label{eq2} \\ &= - \|g_k\|^2 + \lambda \langle Ag_k, g_k\rangle\tag{3}\label{eq3}\ \end{align} $$

Here are my questions:

  1. how do you get from (\ref{eq1}) to (\ref{eq2}):

if $\nabla f(x) = Ax + b$ then $\frac{d}{d\lambda}f(x^{(k)}-\lambda g_k) = A(x^{(k)}-\lambda g_k) + b$. However, how do you get the dot product notation $\langle\cdot, \cdot \rangle$ notation as well as the additional $-g_k$

  1. how do you get from (\ref{eq2}) to (\ref{eq3})
2

There are 2 best solutions below

1
On

Q1

(1) to (2) is followed from the below property of a multivariate function$${d\over dt}f(u+t v)=\nabla f(u+t v)^T\cdot v$$which has a simple and elementary proof using only the fact that $$f(u+t v)=f\Big(u_1+tv_1,u_2+tv_2,\cdots,u_n+tv_n\Big)$$

proof

Note that using calculus:$${d\over dt }f(u+t v){=\sum_{k=1}^n {\partial f\over \partial x_k}\Bigg|_{x=u+tv}\cdot {\partial u_k+tv_k\over \partial t}\\=\sum_{k=1}^n {\partial f\over \partial x_k}\Bigg|_{x=u+tv}\cdot v_k\\=\nabla f(u+t v)^T\cdot v}$$

Q2

(2) to (3) is followed so easily$$\langle A(x^{(k)}-\lambda g_k) + b, - g_k\rangle{=\langle Ax^{(k)}-\lambda A g_k + b, - g_k\rangle\\=\langle g_k-\lambda A g_k , - g_k\rangle\\=-||g_k||^2+\lambda \langle Ag_k,g_k\rangle}$$

1
On

Sorry for the late reply, I believe it could be useful for someone else in the future. As I understood you are missing the chain rule: $$f(x) = \frac{1}{2} x^T A x + b^T x$$ $$\frac{df(x)}{dx} = A x + b = g_n$$ $$x_{n+1} = x_n - \alpha * g_n,$$ noting that x becomes a function of the step-size.

I believe u are missing the following aspect: $$ \frac{df(x)}{d\alpha} = \frac{df(x)}{dx}\frac{dx}{d\alpha} = (Ax + b)(-g_n),$$ by substituting $x$ by $(x_n - \alpha * g_n)$ in the previous equation and making it equal to zero for determining the optimal step, you will find the expression you are looking at: $$ \alpha = \frac{g_n^T g_n}{g_n^T A g_n} $$