I'm reading the book "The Elements of Statistical Learning - Data Mining, Inference, and Prediction" chapter 3 and there comes a simple derivation that I don't understand:
We have: $$(1): RSS(\beta) = (y-X\beta)^T(y-X\beta)$$
with input X is N × (p + 1) matrix, (p+1) column is 1, $X_1$,$X_2$, ... $X_p$; y is N-vector output.
The question is when differentiating with respect to $\beta$ why we obtain this ? $$(2): \frac{\partial RSS}{\partial \beta} = -2X^T(y-X\beta) $$ An answer that explain how to differentiate RSS with respect to $\beta$ will be highly appreciated
You can multiply out the brackets:
$RSS(\beta) = (y-X\beta)^T(y-X\beta)=(y^T-\beta^TX^T)(y-X\beta)$
$=y^Ty-y^TX\beta-\beta^TX^Ty+\beta^TX^TX\beta$
The second and the third summands are scalars. Therefore $y^TX\beta=\beta^TX^Ty$.
$=y^Ty-2\beta^TX^Ty+\beta^TX^TX\beta$
Now we can differentiate RSS w.r.t $\beta$. The first summand become $0$. The second summand becomes $-2X^Ty$ and the third summand becomes $2X^TX\beta$. The rules of differentiation can be directly derived. Thus we get
$\frac{\partial RSS}{\partial \beta}=-2X^Ty+2X^TX\beta=0$
Update:
You can substitute $X^Ty$ by $Z$. I ommit the factor 2. Then we have
$$\frac{\partial\beta^T \cdot Z }{\partial \beta}$$
We can use the property $$\frac{\partial \beta^TZ}{\partial \beta}=\left(\frac{\partial (\beta^TZ)^T}{\partial \beta^T}\right)^T=\left(\frac{\partial Z^T\beta}{\partial \beta^T}\right)^T \qquad (*)$$
An example shows that this is true.
$Z=\left(\begin{array}{} z_{11}& z_{12}\\z_{21}&z_{22} \end{array}\right),\beta =\left(\begin{array}{} \beta_1 \\\beta_2 \end{array}\right)$
$Z^T\cdot \beta=\left(\begin{array}{} z_{11}& z_{21}\\z_{12}&z_{22} \end{array}\right)\cdot \left(\begin{array}{} \beta_1 \\\beta_2 \end{array}\right)=\left(\begin{array}{} z_{11}\beta_1& z_{21}\beta_2\\z_{12}\beta_1&z_{22}\beta_2 \end{array}\right)$
$$\frac{\partial Z^T\cdot \beta}{\partial \beta^T}=\frac{\partial \left(\begin{array}{} z_{11}\beta_1& z_{21}\beta_2\\z_{12}\beta_1&z_{22}\beta_2 \end{array}\right)}{\partial \left(\begin{array}{} \beta_1 &\beta_2 \end{array}\right)}=\left(\begin{array}{} z_{11}& z_{21}\\z_{12}&z_{22}\end{array}\right)=Z^T$$
With $(*)$ we see that $$\frac{\partial\beta^T \cdot Z }{\partial \beta}=Z=X^Ty$$.