Why is gradient descent used?

796 Views Asked by At

It is clear to me how gradient descent works - we compute first-order derivatives in all directions, this gives a vector that points in the direction of the fastest growth of the function, and by following it in the reverse direction, we will approach the global minimum.

This is how it is mostly done in neural networks. And it even got its name - steepest gradient descent. But why? In case of convex function, we can just compute all partial derivatives, set them to 0, then find the root - and we get the critical point where the function should become minimum/maximum depending on the convexity or concavity. Why bother with gradient descent?

NB. Not sure if this question belongs in here though.

2

There are 2 best solutions below

0
On
  1. Even for convex functions, the equation $\nabla f(x) = 0$ is, in most cases of interest, does not have a closed form solution. So you cannot solve it, and you must use an iterative scheme (like Gradient Descent)
  2. The same idea holds for Neural Networks, except that the function is, in most cases of interest, not convex, and Gradient Descent has a very low chance of producing the global minimum.
0
On

I was wondering the same thing, why we don't apply what we learned in calculus, just set the partial derivatives to zero and find the minimal.

I think the main difference is:

  • Partial derivatives set to zero will actually find the minima for your training examples. But once you found your parameters for the data sets then you overfit your parameters and not sure your parameters will work on new data.
  • Gradient descent allows us to minimize the error of the cost function but does not remove the error completely which is actually good since you want to leave some maneuver to your parameters to work on new data.

So I think in machine learning we are not trying to get the exact parameters to get the cost function to exactly zero but we want it to be close to zero. With the partial derivatives method set to zero we are actually solving the problem but only for the training data sets.