The following excerpt is from chapter 4.3 of Deep Learning, by Goodfellow, Bengio, and Courville:
I don't understand the following:
What is meant by $\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1$ in $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1}$?
Why is $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} \mathbf{u}^T \nabla_{\mathbf{x}} f(\mathbf{x}) = \min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} ||\mathbf{u}||_2 || \nabla_{\mathbf{x}} f(\mathbf{x})||_2 \cos(\theta)$? I have no idea how the components of the latter expression came about.
The authors states that the factors that do not depend on $\mathbf{u}$ are ignored. But they then state that the expression simplifies to $\min_{\mathbf{u}} \cos(\theta)$, but $\cos(\theta)$ depends on $\theta$ -- not $\mathbf{u}$?
I'm not sure that I understand what is meant by the explanation immediately following the above, but this could be due to my not understanding the preceding information.
I would greatly appreciate it if people could please take the time to clarify this.
