differentiation of a matrix function

126 Views Asked by At

In statistics, the residual sum of squares is given by the formula

$$ \operatorname{RSS}(\beta) = (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta)$$

I know differentiation of scalar functions, but how to I perform derivatives on this wrt $\beta$? By the way, I am trying to take the minimum of RSS wrt to $\beta$, so I am setting the derivative equal to 0.


I know somehow product rule has to hold. So here I have the first step

$$-\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) + (\mathbf{y}-\mathbf{X}\beta)^T(-\mathbf{X})= 0$$

3

There are 3 best solutions below

4
On BEST ANSWER

First you can remove the transposition sign from the first bracket:

$RSS=(\mathbf{y}^T - \beta ^T \mathbf{X} ^T)(\mathbf{y} - \mathbf{X}\beta)$

Multiplying out:

$RSS=y^Ty-\beta ^T \mathbf{X} ^Ty-y^TX\beta+\beta^TX^T X\beta $

$\beta ^T \mathbf{X} ^Ty$ and $y^TX\beta$ are equal. Thus

$RSS=y^Ty-2\beta ^T \mathbf{X} ^Ty+\beta^TX^T X\beta$

Now you can differentiate with respect to $\beta$:

$\frac{\partial RSS}{\partial \beta}=-2X^Ty+2X^T X\beta=0$

Dividing by 2 and bringing the first summand to the RHS:

$X^T X\beta=X^Ty$

Multiplying both sides by $(X^T X)^{-1}$

$(X^T X)^{-1}X^T X\beta=(X^T X)^{-1}X^Ty$

$(X^T X)^{-1}X^T X= I$ (Identity matrix).

Finally you get $\beta=(X^T X)^{-1}X^Ty$

Equality of $\beta ^T \mathbf{X} ^Ty$ and $y^TX\beta$

I make an example:

$\left( \begin{array}{c c} b_1 & b_2 \end{array} \right) \cdot \left( \begin{array}{c c c} x_{11} & x_{21} \\ x_{12} & x_{22}\end{array} \right) \cdot \left( \begin{array}{c c} y_1 \\ y_2 \end{array} \right) $

$=\left( \begin{array}{c c} b_1x_{11}+b_2x_{12} & b_1x_{21}+b_2x_{22} \end{array} \right) \cdot \left( \begin{array}{c c} y_1 \\ y_2 \end{array} \right)$

$=b_1 x_{11}y_1+b_2 x_{12}y_1+b_1x_{21}y_2+b_2x_{22}y_2\quad (\color{blue}{I})$


$\left( \begin{array}{c c} y_1 & y_2 \end{array} \right) \cdot \left( \begin{array}{c c c} x_{11} & x_{12} \\ x_{21} & x_{22}\end{array} \right) \cdot \left( \begin{array}{c c} b_1 \\ b_2 \end{array} \right) $

$=\left( \begin{array}{c c} y_1x_{11}+y_2x_{21} & y_1x_{12}+y_2x_{22} \end{array} \right) \cdot \left( \begin{array}{c c} b_1 \\ b_2 \end{array} \right)$

$=y_1 x_{11}b_1+y_2 x_{21}b_1+y_1x_{12}b_2+y_2x_{22}b_2 \quad (\color{blue}{II})$

$\color{blue}{I}$ and $\color{blue}{II}$ are equal.

Derivative Rules

$\frac{\partial \beta ^T X ^T y }{\partial \beta }=X^Ty$

$\frac{\partial \beta^T X^T X \beta }{\partial \beta }=2X^TX\beta$

13
On

$X\beta$ is a vector with $i$-th entry (using summation notation): $$(X\beta)_i=X_i\cdot\beta=X_{ij}\beta_j$$ so $(y-X\beta)_i=y_i-X_{ij}\beta_j,$ and thus $$(y-X\beta)^T(y-X\beta)=(y_i-X_{ij}\beta_j)^2,$$ taking a derivative with respect to $\beta_k$ for each $k=1,\ldots,n$ we obtain: $$\partial_{\beta_k}RSS=2(-X_{ik})(y_i-X_{ij}\beta_j)=-2\sum_{i,j=1}^nX_{ik}(y_i-X_{ij}\beta_j)=0,\,\,\,\,k=1,\ldots,n,$$ i.e., $X^k\cdot y=X^k\cdot X\beta,\,\,\,\,k=1,\ldots,n$, where $X^k$ is the $k$-th column of $X$. This is equivalent to $$X^Ty=X^T(X\beta),$$ so $\beta = (X^TX)^{-1}X^Ty$. To verify what kind of critical point we have, take the $l$-th derivative of $\partial_{\beta_k}RSS$, and we have: $$\partial^2_{\beta_k\beta_l}RSS=2\partial_{\beta_l}\sum_{i,j=1}^nX_{ik}X_{ij}\beta_j=2\sum_{i=1}^nX_{ik}X_{il}=2X^k\cdot X^l,$$ now whether or not we have a maximum/ minimum etc, depends on the nature of the matrix $X$, as we need the Hessian of $RSS$ to be a positive definite matrix for a minimum.

4
On

I wouldn't do this by differentiating. Let $P$ be the matrix that projects orthogonally onto the column space of the design matrix $X$. Then $y-Py$ is orthogonal to $P\mathbf y - X\beta$, since $P\mathbf y$ and $X\beta$ are both in the column space of $X$. Then we have \begin{align} \operatorname{RSS}(\beta) & = (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) \\[8pt] & = ((\mathbf y - P\mathbf y)+(P\mathbf y - X\beta))^T((\mathbf y - P\mathbf y)+(P\mathbf y - X\beta)) \\[8pt] & = (\mathbf y-P\mathbf y)^T(\mathbf y-P\mathbf y) + (P\mathbf y-X\beta)^T(P\mathbf y-X\beta) \\ &\phantom{{}={}} + \underbrace{(\mathbf y-P\mathbf y)^T(P\mathbf y-X\beta) + (P\mathbf y-X\beta)^T(\mathbf y - P\mathbf y)}_{\text{This sum is zero, by orthogonality.}} \\[10pt] & = (\mathbf y-P\mathbf y)^T(\mathbf y-P\mathbf y) + (P\mathbf y-X\beta)^T(P\mathbf y-X\beta). \end{align} Now $\beta$ appears only in the second term. The second term can be minimized by making it $0$ since $P\mathbb y$ is in the column space of $X$ and therefore $\beta$ can be so chosen as to make $X\beta=P\mathbf y$.

The matrix $X$ has typically far more rows than columns. If the columns of $X$ are linearly independent, then $X$ has an left-inverse, which is $(X^TX)^{-1}X^T$. Invertibility of $X^TX$ follows from linear independence of the columns of $X$.

Hence the equation $X\beta=P\mathbf y$ can be solved for $\beta$ by multiplying both sides on the left by that left-inverse.