"A Function Reduces by the Greatest Quantity in the Direction of it's Derivative" - How do we know this?

46 Views Asked by At

We are always told that "A Function Reduces by the Greatest Quantity in the Direction of it's Derivative" - but how exactly do we know this?

I tried looking into this and came across something called the "Directional Derivative" (https://en.wikipedia.org/wiki/Directional_derivative): This concept seems to explain that if you have a function and you choose some point on this function - there are many possible directions you could move in relative to that point.

  • However, I am still not certain as to why moving in the "direction of the derivative" is said to reduce the magnitude of the function by the greatest value in a local neighborhood.

Can someone please explain why this is the case?

Thank you!

2

There are 2 best solutions below

0
On

$\newcommand{\d}{\mathrm{d}}$I'll get to the point and explain your direct question after first taking a necessary detour through the general definition of derivative for real functions.

The multidimensional derivative for a map $\Bbb R^n\to\Bbb R$ is often represented by a gradient, as you know, but this is a notational convenience for what the derivative really is, in any dimension. It is a linear map. If you don't know what that is, importantly matrices (through matrix multiplication) and vectors (through dot products) can represent them. They are functions $T$ so that $T(ax+by)=aT(x)+bT(y)$, which you hopefully know is a property of the dot product. We care about such functions since they are very "nice" in wider mathematics. $T(0)=0$ for any such map as well, which is relevant for differentiation.

More specifically, if $f:\Bbb R^n\to\Bbb R^m$, we say it is differentiable at a point $x_0\in\Bbb R^n$ iff. there exists a linear map $\d f_{x_0}:\Bbb R^n\to\Bbb R^m$ such that $\psi(x)$ is both continuous in a neighbourhood of $x_0$ and $\psi(x)\in o(\|x-x_0\|)$, where $\psi$ is given by:

$$\psi(x):=f(x)-f(x_0)-\d f_{x_0}(x-x_0)$$

This means we can say:

$$f(x)=f(x_0)+\d f_{x_0}(x-x_0)+\psi(x)\approx f(x_0)+\d f_{x_0}(x-x_0)$$

When $x$ is close to $x_0$ - linear approximation. If such a linear map $\d f$ exists at every point $x_0$, $f$ is said to be differentiable (and such maps exist uniquely, when they exist at all). In one dimension, linear maps are just multiplications by a scalar, so $\d f_{x_0}(x-x_0)=c(x-x_0)$, and we denote $c=f'(x_0)$.

So, the gradient is such a map, but as we go from $\Bbb R^n\to\Bbb R$, this can be represented by multiplication with a matrix with one row, or by taking the dot product with a vector, $\nabla$. So, we then get to say, whenever $f:\Bbb R^n\to\Bbb R$ is differentiable at a point $x_0$, that:

$$f(x)-f(x_0)=\nabla_{x_0}\cdot(x-x_0)+\psi(x)\approx\nabla_{x_0}\cdot(x-x_0),\,\text{as $x\to x_0$}$$

Where $\psi$ is as before. If I want to find the direction of displacement from $x_0$ that maximises change in $f$, first of all I need to normalise everything to avoid bias, and consider this:

$$\frac{f(x)-f(x_0)}{\|x-x_0\|}=\nabla_{x_0}\cdot\frac{x-x_0}{\|x-x_0\|}+\frac{\psi(x)}{\|x-x_0\|}\approx\nabla_{x_0}\cdot\frac{x-x_0}{\|x-x_0\|}$$

The approximation is ok $^\ast$ when $x$ is close to $x_0$, as $\psi(x)\in o(\|x-x_0\|)$. From the above expression, it is now clear that in order to maximise the LHS to find the "steepest ascent/descent", I must maximise (or minimise) the expression $\nabla_{x_0}\cdot h$, where $h$ is a unit vector of displacement from $x_0$. We know that, where $\vartheta$ is the angle betweeen $\nabla_{x_0},h$: $$\nabla_{x_0}\cdot h=\|h\|\cdot\|\nabla_{x_0}\|\cdot\cos(\vartheta)=\|\nabla_{x_0}\|\cdot\cos(\vartheta)$$And this is maximised (or minimised) precisely when cosine attains its extreme points, at $180^\circ$ or $0^\circ$, so the optimal directions of steepest ascent/descent are parallel to the "direction of the gradient" (dependent on sign whether or not we go for $0,180$ degrees).

I put "direction of the gradient" in commas because formally the gradient, or the derivative/Jacobian matrix, is a linear map. It is an intuitive convenience to think of it as an object with direction, but this will become confusing if you do more multivariable calculus.

$^\ast$ Notice that this is still approximate. The "direction of the gradient" will not necessarily be that of steepest ascent if you move far enough away from $x_0$ - this is only meaningful for $x$ very close to $x_0$.

0
On

Consider a differentiable function $f:\mathbb{R}^n\to\mathbb{R}$ with gradient given by

$$\nabla f(x)=(\partial_1f(x),\dots,\partial_nf(x)).$$

This gradient is the multivariable equivalent of the derivative (and is a special case of the total derivative for when the function still just maps to $\mathbb{R}$). What this means is that it encodes how much the function $f$ changes in each of the basis direction, so for example if $n=2$, then this would represent how much it changes when you change the first variable, i.e. in the $x$-direction, and also how much it changes when you change the second variable, i.e. in the $y$-direction. A good way to visualize it is so simply draw out what it could look like.

Now with this in mind, let us turn to your question of why this means that the function changes the most in the direction of $\nabla f$. Firstly, we can intuitively expect it, as we are simply following the change in both directions separately, and taking another direction to move in will then move away from this, scaling the vectors down a bit.

Now let us turn to actually proving it. As you have stumbled upon yourself, there is a notion of a directional derivative, which simply tells you (with a slight thing to keep in mind) how much the function changes in any given direction. We can define the directional derivative in the direction of $v\in \mathbb{R}^n$ as

$$\nabla_v f(x)=\lim_{h\to 0}\frac{f(x+hv)-f(x)}{h}.$$

It can then be quite easily shown that this means that

$$\nabla_vf(x)=\langle \nabla f(x),v\rangle,$$

where $\langle\cdot,\cdot\rangle$ denotes the standard inner product (the dot product). Now I mentioned there is a slight thing to keep in mind, and this is that this quantity scales depending on the magnitude of $v$, so to actually get the change in a given direction, we should only consider unit vectors $v$, i.e. such that $\lVert v \rVert=1$, as this encodes simply a direction. Then we turn back to some linear algebra and remember also that

$$\langle \nabla f(x),v\rangle=\lVert \nabla f(x)\rVert \lVert v \rVert \cos(\theta),$$

where $\theta$ is the angle between $\nabla f(x)$ and $v$. Since we wanted $\lVert v\rVert=1$, the only thing that really matters here is the angle $\theta$, as $\nabla f(x)$ will have a fixed value. Now when is $\cos(\theta)$ the greatest? You guessed it, when $\theta=0$. But if the angle is zero, this means that $v$ points in the same direction as $\nabla f(x)$, and thus what we expected follows: the direction of greatest change is the direction of the gradient!