In the book I am studying the author motivates that the sum of the distances of data points to the fitted line can be written in matrix form as $$ (t-X\beta)^T(t-X\beta) $$ where X is a matrix that has an observation on each row, t are the column vector of corresponding target values, and $\beta$ are the column vector of parameters we are estimating.
So far, good. Then, to minimize this sum, we need to take the derivative with respect to $\beta$ and set to zero.
We get $$ X^T(t-X\beta)=0 $$ And that is too much a jump for me. I think I know basic algebra, but not matrix calculus. Can you detail the steps from first equation to the second?
I can go this far: $$ (t-X\beta)^T(t-X\beta) $$ $$ (t^T-\beta^TX^T)(t-X\beta) $$ $$ (t^Tt-t^TX\beta-\beta^TX^Tt+\beta^TX^TX\beta) $$
But I do not know how to take the derivative of the last line w.r.t. $\beta$.
Thanks.
Here I list a few basic rules for matrix calculus which can be applied to a large amount of vector derivatives and also can be readily proved like coppper.hat's method. Assume $A$ is a constant matrix and $x$ is a vector, then the following is true
$$\begin{align}\nabla_xAx&=A\\\nabla_xx^TA&=A^T\\\nabla_xx^TAx&=x^TA+(Ax)^T=x^T(A+A^T)\qquad\text{(product rule)}\end{align}$$
Then for your question, the deduction is straightforward. $$\begin{align}&\nabla_\beta(t^Tt-t^TX\beta-\beta^TX^Tt+\beta^TX^TX\beta)\\=&-t^TX-(X^Tt)^T+\beta^T(X^TX+(X^TX)^T)\\=&2\beta^TX^TX-2t^TX\\=&2(X\beta-t)^TX\end{align}$$