Minimizing RSS by taking partial derivative

6.7k Views Asked by At

I am learning about linear regression, and the goal is to find parameters $\beta$, that minimize the RSS. My textbook accomplishes this by finding $\partial \text{ RSS} /\partial \beta = 0$ However, I am slightly stuck on the following step:

They define:

$RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^{T} (\mathbf{y}-\mathbf{X}\beta$,

where $\beta$ are scalars, $y$ is a column vector, and $X$ is a matrix.

They find that

$\frac{\partial RSS}{\partial \beta} = -2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)$

I tried deriving this result. I first wrote: $(\mathbf{y} - \mathbf{X}\beta)^{T} (\mathbf{y}-\mathbf{X}\beta) = (\mathbf{y}^{T} - \mathbf{X}^{T}\beta)(\mathbf{y} - \mathbf{X}\beta)$

I then expanded the two terms in brackets: $\mathbf{y}^{T}\mathbf{y} - \mathbf{y}^{T}\mathbf{X}\beta - \mathbf{y}\mathbf{X}^{T}\beta + \mathbf{X}^{T}\mathbf{X}\beta^2$

Now, I differentiate this with respect to $\beta$: $-\mathbf{y}^{T}\mathbf{X} - \mathbf{y}\mathbf{X}^{T} + 2\beta \mathbf{X}^{T}\mathbf{X}$

This is where I get stuck, comparing my result with the derived result, we both have the $2\beta \mathbf{X}^{T}\mathbf{X}$ term, but I don't know how my first 2 terms should simplify to give $-2\mathbf{X}^{T}\mathbf{y}$.

4

There are 4 best solutions below

2
On BEST ANSWER

Note that $\beta$ is not a scalar, but a vector.

Let $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$$ $$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{bmatrix}$$ and $$\beta = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{bmatrix}\text{.}$$ Then $\mathbf{X}\beta \in \mathbb{R}^N$ and $$\mathbf{X}\beta = \begin{bmatrix} \sum_{j=1}^{p}b_jx_{1j} \\ \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \implies \mathbf{y}-\mathbf{X}\beta=\begin{bmatrix} y_1 - \sum_{j=1}^{p}b_jx_{1j} \\ y_2 - \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ y_N - \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \text{.}$$ Therefore, $$(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta) = \|\mathbf{y}-\mathbf{X}\beta \|^2 = \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)^2\text{.} $$ We have, for each $k = 1, \dots, p$, $$\dfrac{\partial \text{RSS}}{\partial b_k} = 2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ik}\text{.}$$ Then $$\begin{align}\dfrac{\partial \text{RSS}}{\partial \beta} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} \\ \dfrac{\partial \text{RSS}}{\partial b_2} \\ \vdots \\ \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= \begin{bmatrix} -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\text{.} \end{align}$$

2
On

The correct transpose (see property 3) is $(\mathbf{y} - \mathbf{X}\beta)^{T} (\mathbf{y}-\mathbf{X}\beta) = (\mathbf{y}^{T} - \beta^T\mathbf{X}^{T})(\mathbf{y} - \mathbf{X}\beta)$

The correct expansion is $\mathbf{y}^{T}\mathbf{y} - \mathbf{y}^{T}\mathbf{X}\beta - \beta^T \mathbf{X}^{T} \mathbf{y} + \beta^T\mathbf{X}^{T}\mathbf{X}\beta$

You can simplify the expansion to: $$\mathbf{y}^{T}\mathbf{y} + (-\mathbf{X}^{T} \mathbf{y})^T \beta + (-\mathbf{X}^{T} \mathbf{y})^T \beta + \beta^T\mathbf{X}^{T}\mathbf{X}\beta$$ And the result readily follows.

0
On

Expand the brackets to write $$ \begin{align} RSS(\beta)&=y'y-y'X\beta-\beta'X'y+\beta'X'X\beta\\ &=y'y-2\beta'X'y+\beta'X'X\beta \end{align} $$ where primes denote the transpose and $y'X\beta=\beta'X'y= (y'X\beta)'$ since $y'X\beta$ is a $1\times 1$ vector. Now we can differentiate to get that $$ \frac{\partial RSS(\beta)}{\partial \beta}=-2X'y+2X'X\beta=-2X'(y-X\beta) $$ Here we used two properties. First, if $u=\alpha'x$ where $\alpha,x\in\mathbb{R}^n$, then $$ \frac{\partial u}{\partial x_j}=\alpha_j\implies \frac{\partial u}{\partial x}=\alpha. $$ One should notice that $\frac{\partial u}{\partial x}$ in this case represents the gradient. Second if $u=x'Ax=\sum_{i=1}^n\sum_{j=1}^na_{ij} x_{i} x_{j}$ where $A\in M_{n\times n}(\mathbb{R}) $ and $x\in\mathbb{R^n}$, then $$ \frac{\partial u}{\partial x_{\ell}}=\sum_{i=1}^na_{i\ell}x_{i}+\sum_{i=1}^na_{\ell i}x_{i}=[(A'+A)x]_{\ell} \implies \frac{\partial u}{\partial x}=(A'+A)x. $$ In particular if $A$ is symmetric (like $X'X$ as above), we have that $\frac{\partial u}{\partial x}=2Ax$

0
On

Remark: $\beta$ is a vector.

In multiple regression, if you have $n$ independent variables, therefore you have $n+1$ parameters to estimate (included intercept), that is: $$y_t=\beta_0+\beta_1X_{1t}+...\beta_nX_{nt}+e_{t},$$ where each $\beta_{i}$ is scalar. We can write aforementioned with matrix notation (your problem is in matrix notation): $$y=X\beta+e,$$ where $X$ is matrix, $y,\beta$ and $e$ are vectors! More precisely, $\beta_{i}$ is scalar, but $\beta$ is vector. Furthermore, you can note that unique solution of the problem that you have mentioned is the following: $$\beta=(X^{T}X)^{-1}X^{T}y,$$ where you can note easily that $\beta$ is a vector.