Difference between gradient descent and finding stationary points with calculus?

1k Views Asked by At

Take the example of trying to optimize a regression line:

$$ y = b + mx $$ around some data.

Method #1

You can do this by getting the partial derivatives of the error function:

$$ z = (1/2) \Sigma(f(x) - y)^2 $$ with respect to b and m, and then setting these equations to zero to find the stationary points.

Method #2

Use the gradient descent algorithm to find a local minima.

Question:

Method one 'seems' superior. Why couldn't you find at all the stationary points along the x and y axis (z is the error going upwards) and just pick the minimum values for both x and y? Why can't you do this and avoid the iterative process associated with gradient descent?

Where am I going wrong with my intuition..?

2

There are 2 best solutions below

3
On BEST ANSWER

Because the objective function (the sum of the errors squared, over the data points) is precisely a quadratic function, the method of steepest (gradient) descent will select the perfect direction on the first try, and if you go along the descent line until the minimum, will find the minimum in only one iteration.
This is true not only for a one-dimensional line, but for any multivariate linear fit.

The calculations needed to do the gradient descent are precisely the same as those needed to solve the simultaneous equations. And indeed, the practical person would use Method 1.

However, if your objective function were not a perfect quadratic form, then two things happen. The Method 1 becomes impossible, since you can't solve the simultaneous non-liner equations, and the gradient descent method will choose a slightly inferior initial direction, so that multiple iterations will be needed. Here, the practical person is forced to use method 2, (or better, some method like conjugate gradient that deals with issues like spiralling in to the solution slowly, which often happens in naive steepest descent).

0
On

There is no need for gradient descent in this case. Method 1 is great. You set the gradient equal to $0$ and solve for $m$ and $b$.