In this Khan Academy video series Khan goes through the derivation of the formula for the linear regression line for some data points.
The only part I do not understand is the one I've given a link to. Particularly, I don't understand why Khan is so sure that when he sets the partial derivatives to zero, he is going to get the squared error function at its minimum (as opposed to its maximum).
How does he know that? He doesn't explain this in the video, so I believe it must be more or less obvious.
A short answer explaining this in simple terms would be much appreciated.
That the critical point correspond to a mimimum squared error is a very intutively idea but it is not a simple fact to be proved in general.
In the simplest case the crucial fact is that we are dealing with the mimimization of the function of two variables:
$$e=f(m,b)=\sum (mx_i+b-y_i)^2$$
and at the critical point ($\nabla f=0$) we should also verify that the determinant of the Hessian matrix is positive
$$\begin{vmatrix} f_{mm}&f_{mb}\\f_{mb}&f_{bb} \end{vmatrix}=f_{mm}f_{bb}-(f_{mb})^2>0$$
It can be done by induction or by Cauchy-Schwartz inequality.
Here is a nice derivation A Quick Proof that the Least Squares Formulas Give a Local Minimum