I’m following a course on Machine Learning at uni at the moment. It is the first mathematics course I’ve followed in a while, so I’ve had some difficulty getting back into the mathematics ‘way of thinking’ if you know what I mean. The exercise I’m having difficulty with at the moment is as follows:
Let $f : \mathbb{R}^d \to \mathbb{R}$ be differentiable, show that the largest possible increase of the functions
$$ \phi_v(t) = f(x+tv) : v \in \mathbb{R}^d, \|v\|_2 = 1$$
in $t=0$ is $\|\nabla f(x)\|_2$ and is assumed for $v = \nabla f(x) / \|\nabla f(x)\|_2$.
A hint was given of using Cauchy-Schwarz (recalling when this is an equality), but I have no idea how I would even start at solving this problem. Any help at all would be greatly appreciated! Thanks in advance!
By the chain rule
\begin{align} \frac{\Bbb d}{\Bbb dt}\phi_v(t)\vert_{t=0} &= \nabla f(x)\cdot \frac{\Bbb d}{\Bbb dt}(x+tv)\vert_{t=0} \\ &= \nabla f(x)\cdot v \end{align} So in magnitude we find
\begin{align} \left\vert \frac{\Bbb d}{\Bbb dt}\phi_v(t)\vert_{t=0}\right\vert &=\left\vert\nabla f(x)\cdot v\right\vert\\ &\leq \|\nabla f(x)\|_2\|v\|_2\\&=\|\nabla f(x)\|_2\end{align}
where we used the Cauchy-Schwarz. This is an equality exactly when the vectors are parallel so we must have $v = c\nabla f(x)$. As $\|v\|_2=1$ then $c=\pm \frac{1}{\|\nabla f(x)\|_2}$. The $+$ corresponds to greatest increase and the $-$ corresponds to greatest decrease.