Differentiating a matrix function

588 Views Asked by At

In the book "Elements of Statistical Learning", early on the author is discussing linear regression, and naturally discusses the residual sum of squares (RSS) based on the parameter space $\boldsymbol{\beta}$. In the general formulation, $\text{RSS}(\beta) = (\boldsymbol{y} - \boldsymbol{X}\beta)^T(\boldsymbol{y} - \boldsymbol{X}\beta)$ where $\boldsymbol{X}$ is an $N \times p$ matrix and $\beta$ is a $p \times K$ matrix. The author then says to minimize RSS, you differentiate with respect to $\beta$ and get $\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{X}\beta) = 0$

My question is, what are the mechanics of differentiating with respect to the matrix $\beta$? I have a B.S. in physics, so I have a reasonably sophisticated math background, but I never covered this in my undergraduate education. I tried looking a bit into "Matrix calculus", but it wasn't much help. Is that the correct term? If this is the language used in the remainder of the textbook, what are some good resources somewhat familiar with vector calc and linear algebra to learn "matrix calc"?

2

There are 2 best solutions below

0
On BEST ANSWER

It might help to expand the $RSS(\beta)$ out:

$$ RSS(\beta) = \textbf{y}^T\textbf{y} - \textbf{y}^T\textbf{X$\beta$} - \beta^T\textbf{X}^T\textbf{y} + \beta^T\textbf{X}^T\textbf{X$\beta$} $$

Differentiating with respect to a vector $\beta$ has many well known identities. If you're looking for a resource, Wikipedia has an extensive list of them. The ones we're interested in, are,

$$ \frac{d\textbf{a}^T\textbf{x}}{d\textbf{x}} =\frac{d\textbf{x}^T\textbf{a}}{d\textbf{x}} = \textbf{a} \;\;\;\; \& \;\;\;\; \frac{d\textbf{x}^T\textbf{A}\textbf{x}}{d\textbf{x}} = 2\textbf{A}\textbf{x} $$

where $\textbf{A}$ is a symmetric matrix (not a function of $\textbf{x}$) and $\textbf{a}$ is a vector (not a function of $\textbf{x}$). Use the first identity on the two middle terms in $RSS(\beta)$ and the second identity on the last term:

$$ \frac{d}{d\beta}RSS(\beta) = -2\textbf{X}^T\textbf{y} + 2\textbf{X}^T\textbf{X}\beta = 0 $$

Rearrange to get the normal equations. Since this is in the context of statistics, you might want to look at this question posted on Cross Validated. It gives many good references on matrix algebra in statistics.

0
On

I find this document helpful for these kinds of questions. http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf