I am reading Elements of Statistical Learning, and they derive the equation for linear regression using differentiation on a function which takes a matrix as input. This is making my head spin a bit and I am looking for some resources to practice this type of differentiation, and become more comfortable with it.
This is the problem. Let $X \in \mathbb{R}^{N \times p}$, $b \in \mathbb{R}^{p \times k}$, and $y \in \mathbb{R}^{N \times k}$.
Define $RSS(X, y) = \operatorname{tr}[(Xb - y)^t(Xb - y)]$.
I want to differentiate $RSS$ with respect to $b$ and set the derivative to zero since I want to minimize the function, $RSS$.
The book claims this is $X^t(Xb - y) = 0$, which gets us, $b = (X^tX)^{-1}Xy$ if $(X^tX)$ is singular. It does not show how it applied the derivatives (which is expected since thats not what this book is about).
I can sort of reason that we have some sort of product rule happening. It seems likely that $d/db(\operatorname{tr}(A)) = \operatorname{tr}(d/db(A))$. But right now I am just writing down symbols that somehow represent some ideas in my head.
In my Principles of Mathematical Analysis book (by Rudin) he covers when the input is a vector, and the output can be a vector. However, this seems to be input is a matrix, and the output in some of the subproblems is a matrix.
Any recommendations for books I can look at?
Generally I find it easier to think of derivatives in terms of 'perturbations' rather than as a coordinate wise thing.
So, think of finding the derivative as gathering the $h$ terms in the expansion $\phi(b+h)-\phi(b)$.
In this case we have $\phi(b) = \operatorname{tr}((Xb-y)^T (Xb-y))$, so $\phi(b+h) = \operatorname{tr}((b+h)^T X^T X(b+h)- 2 y^T X(b+h)+ y^Ty)$ and so we have $\phi(b+h)-\phi(b)= \operatorname{tr}(2b^T X^T Xh- 2 y^T Xh + h^TX^TXh)$.
Gathering the terms in $h$ we see that $D\phi(b)(h) = \operatorname{tr}(2b^T X^T Xh- 2 y^T Xh) = 2\operatorname{tr}((b^T X^T X- y^T X)h)$.
(Note that I have surreptitiously used the following result above: Since $\operatorname{tr} Z = \operatorname{tr} Z^T$, we have $\operatorname{tr} (Z+Z^T) = \operatorname{tr} (2Z)$.)
If we have $D\phi(b)(h) = 0$ for all $h$, we must have $b^T X^T X- y^T X = 0$, or equivalently $X^T(Xb-y) = 0$.
Aside:
Note that if $L$ is linear (such as trace), then we have $DL(x)(h) = Lh$. Hence $D(L \circ f)(x) (h) = L(D(f(x)(h))$, so the main challenge above is determining the derivative of $b \mapsto (Xb-y)^T (Xb-y)$.
Also, note that $\phi(b) = \|Xb-y\|_F^2$, the Frobenius norm squared.