It is clear to me how gradient descent works - we compute first-order derivatives in all directions, this gives a vector that points in the direction of the fastest growth of the function, and by following it in the reverse direction, we will approach the global minimum.
This is how it is mostly done in neural networks. And it even got its name - steepest gradient descent. But why? In case of convex function, we can just compute all partial derivatives, set them to 0, then find the root - and we get the critical point where the function should become minimum/maximum depending on the convexity or concavity. Why bother with gradient descent?
NB. Not sure if this question belongs in here though.