I had few questions about linear regression derivation.
SSE = Sum i=1toN (yi - bo - b1xi)^2
In above example, i simply found values bo and b1 where SSE is minimum by finding partial derivates of 'bo' and 'b1'. I had few questions about this:
I know (from calculus) that when we take first derivative w.r.t variable it could be the minima or the maxima. In case of linear regression, in most examples i saw that they assumed that the first degree derivative is the minima (to minimize error function). Never saw them taking 2nd degree derivative to confirm. Any reason why, or are those examples just incomplete?
Using gradient decent we can step by step find minima of a function. Why do we need gradient decent if i can just do what i did for linear regression (i.e. find partial derivates and get answers). Could someone site some examples (hopefully with links) where this wont work, and we will need gradient decent?
Thanks
I think that stochasticboy321 made a very good comment to the question.
When you minimize the some of squares for a function which is linear with respect to its parameters, the problem is simple and the so-called normal equations are particularly easy to solve (in particular using matrix calculations).
Where the problem becomes more difficult is when the function is not linear with respect to its parameters. Let us consider that we have $n$ data points $(x_i,y_i)$ and we want to fit the simple model $$y=a\times b^x$$
Defining $z_i=\log(y_i)$ and taking logarithms, the model write $$z_i=\log(a)+x_i\log(b)=A+ B\, x_i$$ and the standard linear regression provides the values of $A,B$ which, in turn, give estimates of $a=e^A$, $b=e^B$. This are estimates because $A,B$ are computed minimizing the sum of squared errors on $z$ that is to say $$E_1=\sum_{i=1}^n (A+ B x_i-\log(y_i))^2$$while what is measured is $y$ and what has to be minimized is $$E_2=\sum_{i=1}^n (a\times b^{x_i}-y_i)^2$$ which is much more complex and requires either an optimizer (different methods exist - among them gradient descent) or Newton-Raphson method to solve $$\frac{\partial E}{\partial a}=\frac{\partial E}{\partial b}=0$$ But the linearized model, even if wrong, will provide a reasonable and consistent starting values which, in any manner, you need to provide the optimizer.
Edit
I suggest you have a look here; it illustrates the problem for the fit of the logistic function and contains a worked example.