Why do we use SGD and not a root finding numerical method to compute the gradient equal to 0 for convex functions?

270 Views Asked by At

Let's suppose we have a convex function F(x), where $x\in\mathbb{R}^D$. Since F is convex, we know that the global minimum is found by solving the system of equations $\nabla F(x)=0$.

My question is, numerically, why don't we find the roots of $\nabla F(x)=0$ by means of a numerical method like Newton-Raphson instead of updating with SGD and variants, i.e.,

$x^{k}=x^{k-1}-\alpha\nabla F(x^{k-1})$.

Is there any advantage?

1

There are 1 best solutions below

0
On BEST ANSWER

why don't we find the roots of $\nabla F(x)=0$ by means of a numerical method like Newton-Raphson instead of updating with SGD and variants?

Is there such a thing? You could apply any method to compute a minimum. I can say that; usually, it's not needed much precision on the root for a good candidate of minimizer. Other than that, not necessarily, whenever $\|\nabla F (x)\| \leq \epsilon$, $x$ is a good solution, and not necessarily the Hessian, say $\nabla^{2} f$, have good condition number around the solution. For example, the function $g:\mathbb{R}\rightarrow \mathbb{R}$ such that $g(0)=0$ and $g(x) = x^2\exp(-1/x^2),$ for all other $x \in \mathbb{R},$ is convex, but the Hessian is terribly behaved. This means that Newton's method does not behave well for that problem. On the other hand, gradient-based methods, due to an Armijo-type rule, might take more significant steps toward the minimum when compared to Newton's method. Also, stochastic gradient descent does not require storing a large matrix as the Hessian and the entire gradient of a function. For example, when a function is a sum of simple convex functions, like in the least-square problem, the SDG needs to compute one gradient of the simple convex functions per iteration, and yet a good solution will be found in a small number of epochs (usually, not more than ten epochs). On the contrary, Newton's method requires computing each of the gradients of the simple convex functions and then sum all, and, after, needs to solve a linear system. Each Newton's method step has at least the cost of one epoch. In summary, if you don't need high precision (or you need "low" precision, like in machine learning to overcome data overfitting) and your objective function is a sum of simple convex functions, SGD is a good algorithm for your problem.