Matrix differentiation by a vector in Least Squares method

275 Views Asked by At

In a book The Elements of Statistical Learning published by Springer we can find following statement:

We can write

$RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)$

where $\mathbf{X}$ is an $N\times p$ matrix with each row an input vector, and $\mathbf{y}$ is an $N$-vector of the outputs in the training set. Differentiating w.r.t. $\beta$ we get the normal equations

$\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) = 0$

Questions

How do I formally derive the normal equations operating on matrix level calculations without diving into operating on scalar elements?

Is my Second attemp valid?

First attemp

Note: $\beta$ is an $p$-vector. Let us assume that vectors are vertical matrixes.

As in The Matrix Cookbook (http://www.math.uwaterloo.ca/~hwolkowi//matrixcookbook.pdf) let us assume that $\partial\mathbf{X}^T = (\partial\mathbf{X})^T$ and $\partial(\mathbf{XY})=\partial(\mathbf{X})\mathbf{Y}+\mathbf{X}\partial(\mathbf{Y})$.

Let us differentiate with respect to $\beta$ and observe that $\partial (\mathbf{y}-\mathbf{X}\beta)=-\mathbf{X}$.

Now $\partial RSS(\beta)=(\partial (\mathbf{y}-\mathbf{X}\beta))^T(\mathbf{y}-\mathbf{X}\beta)+(\mathbf{y}-\mathbf{X}\beta)^T \partial (\mathbf{y}-\mathbf{X}\beta)$

Which gives us $\partial RSS(\beta)=-\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)-(\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta))^T$

At this point we find contradiction because dimensions are incompatible to perform summation. $\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)$ is vertical $p$-vector, while $(\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta))^T$ is horizontal $p$-vector.

Second attemp

If I assumed $\partial(\mathbf{X}^T\mathbf{Y})=\partial(\mathbf{X})^T\mathbf{Y}+(\mathbf{X}^T\partial(\mathbf{Y}))^T$

I would get that $\partial RSS(\beta)=-2\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)$, which matches with the normal equations from the book.

1

There are 1 best solutions below

0
On BEST ANSWER

If you have not found an example, then here it goes.


Before we start deriving the gradient, some facts and notations for brevity:

  • Trace and Frobenius product relation $$\left\langle A, B C\right\rangle={\rm tr}(A^TBC) := A : B C$$
  • Cyclic properties of Trace/Frobenius product \begin{align} A : B C &= BC : A \\ &= B^T A : C \\ &= {\text{etc.}} \cr \end{align}

Let $f := \left\|y- X\beta \right\|^2 = \left(y- X\beta \right)^T \left(y- X\beta \right) = y- X\beta:y- X\beta$.

Now, we can obtain the differential first, and then the gradient. \begin{align} df &= d\left( y- X\beta:y- X\beta \right) \\ &= 2\left(y- X\beta \right) : -X d\beta \\ &= -2X^T\left(y- X\beta\right) : d\beta\\ \end{align}

Thus, the gradient is \begin{align} \frac{\partial}{\partial \beta} \left( \left\|y - X \beta \right\|^2 \right)= -2X^T\left(y- X\beta\right). \end{align}