On Wikipedia, this is the following description of gradient descent:
Gradient descent is based on the observation that if the multivariable function $F(\mathbf{x})$ is defined and differentiable in a neighborhood of a point $\mathbf{a}$, then $F(\mathbf{x})$ decreases fastest if one goes from $\mathbf{a}$ in the direction of the negative gradient of $F$ at $\mathbf{a}$.
Now I have several doubts in this description. First of all I have an example of $f(x)=x^2$ in my mind and my starting point is, say, $x=5$.
- What is the meaning of "decreases fastest"? I mean, I can go straight from $x=5$ to $x=0$ (which is minimum point), then what's the point of fastest decrease? What is the notion of fast here?
- Where did this observation come from? I didn't see the proof of this observation.
In two dimensions, like in the example of $f(x)=x^2$,at the point $(5,25)$ on the curve, there are only two ways to "move" and still be on the curve: to the left and to the right, and in our case, the value of the function decreases as we move to the left, and increases as we go to the right. No problems.
But, when we are on, for example the following function $z=f(x,y)$ which is a two variable function, which looks like the inverted dome below.
Imagine that you are on top of the dome. Then, There are an infinite number of curves(see figure) you can move on and still stay on the surface. And, since our location is so symmetric, it does not matter which direction we move in- it will always be the same rate of change in all directions.
But, suppose we were not on such a symmetric location, say a little lower in any direction. Then, with a little more thought we can see that there is exactly one direction in which you must move in order to achieve maximum descent. That is downward on the curve connecting the point you are on to the top of the dome.
As for rigorous proof of this phenomenon,J.M's link in the comments is an excellent resource.