Residual sum of square (RSS) is defined as
RSS(beta) = $(y-X * beta)^t (y-X * beta)X$
While differentiating RSS(beta) w.r.t to beta to find the minimum value of the function, author reaches the conclusion that
$X^T(y-X * beta) = 0$
Where $X$ is an $N*p$ matrix, $Y$ is a $N*1$ vector and $beta$ is $p*1$ vector.
Can someone please point me how this conclusion was reached ?
Note: This is from the book Elements of statistical.
Write the RSS in terms of the Frobenius (:) product then find its differential and gradient wrt $\beta$
$$\eqalign{ R &= (X\beta-y):(X\beta-y) \cr\cr dR &= 2\,(X\beta-y):(X\,d\beta) \cr &= 2\,X^T(X\beta-y):d\beta \cr\cr \frac{\partial R}{\partial\beta} &= 2\,X^T(X\beta-y) \cr }$$ Now set the gradient equal to zero, like the author does, and solve for $\beta$.
If you're uncomfortable with the Frobenius products in the above derivation, you can substitute the equivalent trace function, $A:B={\rm tr}(A^TB)$