I have been trying to follow the derivation of the normal equations, but there is one part I do not understand.
So, if we minimize
$L(\mathbf{b})=\mathbf{y}^T\mathbf{y}-(2\mathbf{y}^T\mathbf{X})\mathbf{b}+\mathbf{b}^T(\mathbf{X}^T\mathbf{X})\mathbf{b}$
then $\frac{\delta L(\mathbf{b})}{\delta \mathbf{b}}= \mathbf{0}-2\mathbf{X}^T\mathbf{y}+2(\mathbf{X}^T\mathbf{X})\mathbf{b}$
I would have thought $(2\mathbf{y}^T\mathbf{X})\mathbf{b}$ simply becomes $(2\mathbf{y}^T\mathbf{X})$. But apparently it does not, and I cannot find the full derivation anywhere. I'd be very grateful for an explanation.
There are two ways of writing it - in either way you must make sure you are consistent with where the index of your derivative goes.$$\frac{dL}{db_p}=\frac{d}{db_p}\left( y_jy_j-2y_iX_{ij}b_j+b_i X_{ki}X_{kj}b_j \right)=0-2y_iX_{ip}+X_{kp}X_{kj}b_j+b_iX_{ki}X_{kp}\\=-2y_iX_{ip}+2X_{ki}X_{kp}b_i$$This can be written in one of two ways: $\left[-2y^TX+2b^TX^TX \right]_p$ or $\left[-2X^Ty+2X^TXb \right]_p$. The former is in the form of a row vector, the latter is a column vector. You probably want your answer to be a column vector, so you go for $$\frac{dL}{d\vec b}=-2X^T\vec y+2X^TX\vec b$$