I am learning about linear regression, and the goal is to find parameters $\beta$, that minimize the RSS. My textbook accomplishes this by finding $\partial \text{ RSS} /\partial \beta = 0$ However, I am slightly stuck on the following step:
They define:
$RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^{T} (\mathbf{y}-\mathbf{X}\beta$,
where $\beta$ are scalars, $y$ is a column vector, and $X$ is a matrix.
They find that
$\frac{\partial RSS}{\partial \beta} = -2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)$
I tried deriving this result. I first wrote: $(\mathbf{y} - \mathbf{X}\beta)^{T} (\mathbf{y}-\mathbf{X}\beta) = (\mathbf{y}^{T} - \mathbf{X}^{T}\beta)(\mathbf{y} - \mathbf{X}\beta)$
I then expanded the two terms in brackets: $\mathbf{y}^{T}\mathbf{y} - \mathbf{y}^{T}\mathbf{X}\beta - \mathbf{y}\mathbf{X}^{T}\beta + \mathbf{X}^{T}\mathbf{X}\beta^2$
Now, I differentiate this with respect to $\beta$: $-\mathbf{y}^{T}\mathbf{X} - \mathbf{y}\mathbf{X}^{T} + 2\beta \mathbf{X}^{T}\mathbf{X}$
This is where I get stuck, comparing my result with the derived result, we both have the $2\beta \mathbf{X}^{T}\mathbf{X}$ term, but I don't know how my first 2 terms should simplify to give $-2\mathbf{X}^{T}\mathbf{y}$.
Note that $\beta$ is not a scalar, but a vector.
Let $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$$ $$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{bmatrix}$$ and $$\beta = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{bmatrix}\text{.}$$ Then $\mathbf{X}\beta \in \mathbb{R}^N$ and $$\mathbf{X}\beta = \begin{bmatrix} \sum_{j=1}^{p}b_jx_{1j} \\ \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \implies \mathbf{y}-\mathbf{X}\beta=\begin{bmatrix} y_1 - \sum_{j=1}^{p}b_jx_{1j} \\ y_2 - \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ y_N - \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \text{.}$$ Therefore, $$(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta) = \|\mathbf{y}-\mathbf{X}\beta \|^2 = \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)^2\text{.} $$ We have, for each $k = 1, \dots, p$, $$\dfrac{\partial \text{RSS}}{\partial b_k} = 2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ik}\text{.}$$ Then $$\begin{align}\dfrac{\partial \text{RSS}}{\partial \beta} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} \\ \dfrac{\partial \text{RSS}}{\partial b_2} \\ \vdots \\ \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= \begin{bmatrix} -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\text{.} \end{align}$$