Gradient Descent: Minimising the Directional Derivative in Direction $\mathbf{u}$

887 Views Asked by At

The following excerpt is from chapter 4.3 of Deep Learning, by Goodfellow, Bengio, and Courville:

enter image description here

I don't understand the following:

  1. What is meant by $\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1$ in $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1}$?

  2. Why is $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} \mathbf{u}^T \nabla_{\mathbf{x}} f(\mathbf{x}) = \min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} ||\mathbf{u}||_2 || \nabla_{\mathbf{x}} f(\mathbf{x})||_2 \cos(\theta)$? I have no idea how the components of the latter expression came about.

  3. The authors states that the factors that do not depend on $\mathbf{u}$ are ignored. But they then state that the expression simplifies to $\min_{\mathbf{u}} \cos(\theta)$, but $\cos(\theta)$ depends on $\theta$ -- not $\mathbf{u}$?

  4. I'm not sure that I understand what is meant by the explanation immediately following the above, but this could be due to my not understanding the preceding information.

I would greatly appreciate it if people could please take the time to clarify this.

2

There are 2 best solutions below

3
On BEST ANSWER
  1. u is a unit vector in the direction that you want to evaluate the slope. That is why uTu=1 in the minimization.
  2. The statement in your second question is simply the dot product between the u vector and the gradient vector (del f), which is always the two lengths times the cosine of the included angle theta.
  3. One can ignore the two magnitudes because they are fixed values independent of direction, and it is the relative directions of the two vectors that define theta.
0
On

Note that steepest descent and gradient descent are related, but there are subtleties that I think get obfuscated by calling them "synonymous" as the authors imply.

So...as an alternative to searching along all directional derivatives, we can derive gradient descent (GD) from the principle of "steepest descent" using a particular ("prox") model approximation.

We wish to minimize the cost function $J(w)$. At the $t^{th}$ iteration, we seek the next step $w^{(t+1)}$ which minimizes $J(w)$ "the most". To quantify "most", we can use Taylor's theorem to guide our choice of the next iterate as the minimizer $$ w^{(t+1)} \gets \arg\min_w J(w^{(t)}) + \left\langle \nabla J(w^{(t)})\, , \,w - w^{(t)} \right\rangle + \frac{\eta}{2} \|w - w^{(t)}\|^2 $$ for some choice of norm. Rewriting $w = w^{(t)} + s \Delta w$ and substituting above gives $$ w^{(t+1)} \gets \arg\min_{\|\Delta w\|^2=1} J(w^{(t)}) + s \left\langle \nabla J(w^{(t)})\, , \,\Delta w \right\rangle + \frac{\eta}{2} s^2 \|\Delta w\|^2. $$ Solving for the optimal $s^*$ with calculus gives \begin{align*} s^* = -\frac{\left\langle \nabla J(w^{(t)})\, , \,\Delta w \right\rangle}{\eta} \end{align*} which then means we can find the best $w$ from $$ \arg\min_{\|v\|^2=1} -\frac{\left\langle \nabla J(w^{(t)})\, , \,v \right\rangle}{\eta} = \arg\max_{\|v\|^2=1} \left\langle \nabla J(w^{(t)})\, , \,v \right\rangle. $$ Note that the choice of the Euclidean norm $\|v\|_2$ gives the gradient as the steepest descent direction, and that the notion of alignment (or $\cos\theta$) comes in through the inner product. Other choices of norm lead to other "steepest descent" directions (e.g., consider $\ell_1$ and $\ell_{\infty}$ norms).