Why is gradient descent used?

796 Views Asked by Bumbble Comm At 31 Mar 2026 - 2:46

It is clear to me how gradient descent works - we compute first-order derivatives in all directions, this gives a vector that points in the direction of the fastest growth of the function, and by following it in the reverse direction, we will approach the global minimum.

This is how it is mostly done in neural networks. And it even got its name - steepest gradient descent. But why? In case of convex function, we can just compute all partial derivatives, set them to 0, then find the root - and we get the critical point where the function should become minimum/maximum depending on the convexity or concavity. Why bother with gradient descent?

NB. Not sure if this question belongs in here though.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 16 May 2017 - 12:51

Even for convex functions, the equation $\nabla f(x) = 0$ is, in most cases of interest, does not have a closed form solution. So you cannot solve it, and you must use an iterative scheme (like Gradient Descent)
The same idea holds for Neural Networks, except that the function is, in most cases of interest, not convex, and Gradient Descent has a very low chance of producing the global minimum.

Bumbble Comm On 19 Nov 2017 - 8:11

I was wondering the same thing, why we don't apply what we learned in calculus, just set the partial derivatives to zero and find the minimal.

I think the main difference is:

Partial derivatives set to zero will actually find the minima for your training examples. But once you found your parameters for the data sets then you overfit your parameters and not sure your parameters will work on new data.
Gradient descent allows us to minimize the error of the cost function but does not remove the error completely which is actually good since you want to leave some maneuver to your parameters to work on new data.

So I think in machine learning we are not trying to get the exact parameters to get the cost function to exactly zero but we want it to be close to zero. With the partial derivatives method set to zero we are actually solving the problem but only for the training data sets.

Why is gradient descent used?

There are 2 best solutions below

Related Questions in OPTIMIZATION

Related Questions in CONVEX-OPTIMIZATION

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions