We are always told that "A Function Reduces by the Greatest Quantity in the Direction of it's Derivative" - but how exactly do we know this?
I tried looking into this and came across something called the "Directional Derivative" (https://en.wikipedia.org/wiki/Directional_derivative): This concept seems to explain that if you have a function and you choose some point on this function - there are many possible directions you could move in relative to that point.
- However, I am still not certain as to why moving in the "direction of the derivative" is said to reduce the magnitude of the function by the greatest value in a local neighborhood.
Can someone please explain why this is the case?
Thank you!
$\newcommand{\d}{\mathrm{d}}$I'll get to the point and explain your direct question after first taking a necessary detour through the general definition of derivative for real functions.
The multidimensional derivative for a map $\Bbb R^n\to\Bbb R$ is often represented by a gradient, as you know, but this is a notational convenience for what the derivative really is, in any dimension. It is a linear map. If you don't know what that is, importantly matrices (through matrix multiplication) and vectors (through dot products) can represent them. They are functions $T$ so that $T(ax+by)=aT(x)+bT(y)$, which you hopefully know is a property of the dot product. We care about such functions since they are very "nice" in wider mathematics. $T(0)=0$ for any such map as well, which is relevant for differentiation.
More specifically, if $f:\Bbb R^n\to\Bbb R^m$, we say it is differentiable at a point $x_0\in\Bbb R^n$ iff. there exists a linear map $\d f_{x_0}:\Bbb R^n\to\Bbb R^m$ such that $\psi(x)$ is both continuous in a neighbourhood of $x_0$ and $\psi(x)\in o(\|x-x_0\|)$, where $\psi$ is given by:
$$\psi(x):=f(x)-f(x_0)-\d f_{x_0}(x-x_0)$$
This means we can say:
$$f(x)=f(x_0)+\d f_{x_0}(x-x_0)+\psi(x)\approx f(x_0)+\d f_{x_0}(x-x_0)$$
When $x$ is close to $x_0$ - linear approximation. If such a linear map $\d f$ exists at every point $x_0$, $f$ is said to be differentiable (and such maps exist uniquely, when they exist at all). In one dimension, linear maps are just multiplications by a scalar, so $\d f_{x_0}(x-x_0)=c(x-x_0)$, and we denote $c=f'(x_0)$.
So, the gradient is such a map, but as we go from $\Bbb R^n\to\Bbb R$, this can be represented by multiplication with a matrix with one row, or by taking the dot product with a vector, $\nabla$. So, we then get to say, whenever $f:\Bbb R^n\to\Bbb R$ is differentiable at a point $x_0$, that:
$$f(x)-f(x_0)=\nabla_{x_0}\cdot(x-x_0)+\psi(x)\approx\nabla_{x_0}\cdot(x-x_0),\,\text{as $x\to x_0$}$$
Where $\psi$ is as before. If I want to find the direction of displacement from $x_0$ that maximises change in $f$, first of all I need to normalise everything to avoid bias, and consider this:
$$\frac{f(x)-f(x_0)}{\|x-x_0\|}=\nabla_{x_0}\cdot\frac{x-x_0}{\|x-x_0\|}+\frac{\psi(x)}{\|x-x_0\|}\approx\nabla_{x_0}\cdot\frac{x-x_0}{\|x-x_0\|}$$
The approximation is ok $^\ast$ when $x$ is close to $x_0$, as $\psi(x)\in o(\|x-x_0\|)$. From the above expression, it is now clear that in order to maximise the LHS to find the "steepest ascent/descent", I must maximise (or minimise) the expression $\nabla_{x_0}\cdot h$, where $h$ is a unit vector of displacement from $x_0$. We know that, where $\vartheta$ is the angle betweeen $\nabla_{x_0},h$: $$\nabla_{x_0}\cdot h=\|h\|\cdot\|\nabla_{x_0}\|\cdot\cos(\vartheta)=\|\nabla_{x_0}\|\cdot\cos(\vartheta)$$And this is maximised (or minimised) precisely when cosine attains its extreme points, at $180^\circ$ or $0^\circ$, so the optimal directions of steepest ascent/descent are parallel to the "direction of the gradient" (dependent on sign whether or not we go for $0,180$ degrees).
I put "direction of the gradient" in commas because formally the gradient, or the derivative/Jacobian matrix, is a linear map. It is an intuitive convenience to think of it as an object with direction, but this will become confusing if you do more multivariable calculus.
$^\ast$ Notice that this is still approximate. The "direction of the gradient" will not necessarily be that of steepest ascent if you move far enough away from $x_0$ - this is only meaningful for $x$ very close to $x_0$.