Differentiation with respect to a matrix (residual sum of squares)?

9.3k Views Asked by At

I've never heard of differentiating with respect to a matrix. Let $\mathbf{y}$ be a $N \times 1$ vector, $\mathbf{X}$ be a $N \times p$ matrix, and $\beta$ be a $p \times 1$ vector. Then the residual sum of squares is defined by $$\text{RSS}(\beta) = \left(\mathbf{y}-\mathbf{X}\beta\right)^{T}\left(\mathbf{y}-\mathbf{X}\beta\right)\text{.}$$ The Elements of Statistical Learning, 2nd ed., p. 45, states that when we differentiate this with respect to $\beta$, we get $$\begin{align} &\dfrac{\partial\text{RSS}}{\partial \beta} = -2\mathbf{X}^{T}\left(\mathbf{y}-\mathbf{X}\beta\right) \\ &\dfrac{\partial^2\text{RSS}}{\partial \beta\text{ }\partial \beta^{T}} = 2\mathbf{X}^{T}\mathbf{X}\text{.} \end{align}$$ I mean, I could look at $\mathbf{y}$ and $\mathbf{X}$ as "constants" and $\beta$ as a variable, but it's unclear to me where the $-2$ in $\dfrac{\partial\text{RSS}}{\partial \beta}$ comes from, and why we would use $\beta^T$ for the second partial.

Any textbooks that cover this topic would be appreciated as well.

Side note: this is not homework. Please note that I graduated with an undergrad degree only, so assume that I've seen undergraduate real analysis, abstract algebra, and linear algebra for my pure mathematics background.

1

There are 1 best solutions below

2
On

So, what you have here is basically a functional. You're inputting a matrix ($\mathbf{X}$) and a couple vectors ($\mathbf{y}$ and $\beta$), then combining them in such a way that the output is just a number. So, what we need here is called a functional derivative.

Let $\epsilon > 0$ and $\gamma$ be an arbitrary $p \times 1$ vector, then $$\frac{\partial \text{RSS}}{\partial \beta} \equiv \lim_{\epsilon \to 0} \Big((\epsilon \gamma^T)^{-1}\big(\text{RSS}(\beta + \epsilon \gamma) - \text{RSS}(\beta)\big) \Big). $$

We're adding a small, arbitrary vector to $\beta$ and then seeing how that changes $\text{RSS}$. We 'divide' out this arbitrary vector at the end, and I've used the transpose here because $\beta$ and $\gamma$ enter the original functional as multiplication from the right, so coming from the left we use the transpose. All that is left is to evaluate these expressions.

$$\text{RSS}(\beta+\epsilon\gamma) = \left(\mathbf{y}-\mathbf{X}(\beta+\epsilon\gamma)\right)^{T}\left(\mathbf{y}-\mathbf{X}(\beta+\epsilon\gamma)\right) = \left((\mathbf{y}-\mathbf{X}\beta)^{T}-(\mathbf{X}\epsilon\gamma)^T)\right)\left((\mathbf{y}-\mathbf{X}\beta)-\mathbf{X}\epsilon\gamma)\right) $$ $$= (\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta)-(\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\epsilon\gamma-(\mathbf{X}\epsilon\gamma)^T(\mathbf{y}-\mathbf{X}\beta)+(\mathbf{X}\epsilon\gamma)^T\mathbf{X}\epsilon\gamma $$ $$=\text{RSS}(\beta)- \epsilon \big((\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\gamma+(\mathbf{X}\gamma)^T(\mathbf{y}-\mathbf{X}\beta)\big) + \epsilon^2 (\mathbf{X}\gamma)^T\mathbf{X}\gamma $$ So, $$\frac{\text{RSS}(\beta + \epsilon \gamma) - \text{RSS}(\beta)}{\epsilon \gamma^T} = \frac{-\big((\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\gamma+(\mathbf{X}\gamma)^T(\mathbf{y}-\mathbf{X}\beta)\big) + \epsilon (\mathbf{X}\gamma)^T\mathbf{X}\gamma}{\gamma^T}. $$

The third term, than, does not survive in the limit and we are left with $$\frac{-\big((\gamma^T \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta))+(\gamma^T \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta))^T\big)}{\gamma^T} $$

However, since both of these terms are just $1 \times 1$ matrices, A.K.A. scalars, then the term and its transpose are equal and we are left with $$\frac{\partial \text{RSS}}{\partial \beta} = -2 \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) $$