Demonstration Gradient Descent

148 Views Asked by At

The directional derivative represents the slope of a function along a direction. The minimization of the directional derivative is equal to the anti-gradient, and the proof of this is clear to me. But why is the minimization of the directional derivative at a point x, equivalent to finding the direction where the function at point x decreases faster? I can't understand why the anti-gradient gives direction and direction in which the function decreases faster. What is the relationship between the minimization of the directional derivative and the monotony of a function to be minimized?

1

There are 1 best solutions below

7
On BEST ANSWER

The answer to your question is rooted in the relation between derivatives and linear approximations by tangents. When you take a differentiable real-valued function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, a linear approximation of $f$ at a point $a = (a_1, a_2, \dots, a_n) \in \mathbb{R}^n$ could be defined by:

$$g(x) = f(a) + \sum\limits_{i=1}^{n} f_{x_i}(a)(x_i-a_i)$$ where $x = (x_1, x_2, \dots, x_n) \in \mathbb{R}^n$.

It is easy to verify that $g(a) = f(a)$ and $\nabla g = \nabla f$.

Now, take any unit vector $\hat{u}$, if we want to minimize $D_{\hat{u}}f(a)$, we have to minimize $\nabla f(a) \cdot \hat{u} = \nabla g(a) \cdot \hat{u}$. So we could look at the function $g$ instead of $f$.

For the cases of the first and second dimensions where the graph of $g$ is a line and a plane respectively, it is easy to imagine that the direction which minimizes the directional derivative is in the opposite direction of the gradient.

In higher dimensions, it becomes more difficult to see the relationship. Let's look for a unit vector $\hat{u} = (u_1, u_2, \dots, u_n)$ such that $g(a + \hat{u})$ is minimized.

$$g(a + \hat{u}) = f(a) + \sum\limits_{i=1}^{n} f_{x_i}(a)((a_i + u_i)-a_i) = f(a) + \sum\limits_{i=1}^{n} f_{x_i}(a) u_i = f(a) + \nabla g(a) \cdot \hat{u}$$

Since $f(a)$ is a constant, then we need to minimize the term $\nabla g(a) \cdot \hat{u}$. It could be easily verified that the minimum of the term occurs when $\hat{u} = - \frac{\nabla g(a)}{\| \nabla g(a)\|}$. In other words, when $\hat{u}$ is in the opposite direction of the gradient.

However, this is not true for minimizing $f(a + \hat{u})$. At the first glance, one might think that to minimize $f(a + \hat{u})$, the unit vector $\hat{u}$ must be in the opposite direction of the gradient. But the length of the vector $\hat{u}$ is 1 which is large enough for the function to have many local minima in the unit disc around $a$ so that no single direction is optimal.

This confusion in the meaning of "steepest descent" or "fast decreasing" makes one think incorrectly of gradient descent. The direction of the "steepest descent" should only be understood "infinitesimally" and not with a distance away from a point. It should be noted that the opposite direction of the gradient may not be the best direction to find an absolute minimum or even a good local minimum after going through several iterations.