Where am I going wrong in solving $\frac{\partial}{\partial \mathbf w}(\mathbf y - \mathbf X\mathbf w)^T(\mathbf y - \mathbf X \mathbf w) = 0$?

72 Views Asked by At

I have the following equation which I wish to solve:

$$\frac{\partial}{\partial \mathbf w}(\mathbf y - \mathbf X\mathbf w)^T(\mathbf y - \mathbf X \mathbf w) = 0$$

Here $\mathbf y_{n*1}, \mathbf X_{n*2},\mathbf w_{2*1},$

My solution (done on paper because MathJax is a bit difficult for me to use):

enter image description here

Also, is my reasoning for step 4 correct?

3

There are 3 best solutions below

2
On BEST ANSWER

Line $3$ to line $4$, note that $$ \frac{\partial}{\partial w} (y^TXw) = X^Ty, $$ then you'll get the right answer $$ \hat{w} = (X^TX)^{-1}X^Ty. $$

Explicit derivation: Note that $$ y^TXw = w_1\sum_{i=1}^ny_i + w_2\sum_{i=1}^ny_ix_{1i}+\cdots+w_p\sum_{i=1}^ny_ix_{pi}, $$ taking derivative w.r.t vector $w$, $w \in \mathbb{R}^p$, will result in a gradient, i.e., vector with $p$ rows and $1$ column, namely $$ \begin{pmatrix} \sum y_i \\ \sum y_i x_{1i}\\ \vdots \\ \sum y_i x_{pi} \end{pmatrix}, $$ where the $j$th row is the derivative of $y^TXw$ w.r.t. $w_j$. Now, as $X^T$ is $p\times n$ and $y$ is $n \times 1$, hence $X^Ty$ is $p \times 1$ as required.

5
On

No your reasoning in step 4 is wrong. For example if $X$ is a square matrix, $\mathbf{X}^T \mathbf{X}$ will not be a scalar. Therefore your result is wrong. Do note that $$\frac{\partial}{\partial \mathbf{w}} \left(\mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} \right) = 2 \mathbf{X}^T \mathbf{X} \mathbf{w}$$

I am sure that you can get to the right answer from here.

1
On

Let $M = \mathbf{y} - \mathbf{X} \mathbf{w}$

and $f = M^T M = M : M$.

We will utilize the following the identities

  • Trace and Frobenius product relation $$A:B={\rm tr}(A^TB)$$ or $$A^T:B={\rm tr}(AB)$$
  • Cyclic property of Trace/Frobenius product $$\eqalign{ A:BC &= AC^T:B \cr &= B^TA:C \cr &= {\text etc.} \cr }$$

Now, we obtain the differential first and thereafter we obtain the gradient.

So, \begin{align} df &= \left( d M: M \right) + \left( M : dM \right)\\ &= 2M : dM \\ &= 2M : \left( - \mathbf{X} d \mathbf{w} \right) \\ &= - 2\mathbf{X}^T M \ : \ d \mathbf{w} \hspace{8mm} \text{note: utilized cyclic property of Frobenius product} \\ &= - 2\mathbf{X}^T \left( \mathbf{y} - \mathbf{X} \mathbf{w} \right) \ : \ d \mathbf{w} \ . \end{align}

Thus, the gradient reads \begin{align} \frac{\partial}{\partial \mathbf{w}} f = - 2\mathbf{X}^T \left( \mathbf{y} - \mathbf{X} \mathbf{w} \right) \ . \end{align}

Then you can set the gradient to $0$ and obtain your $$\mathbf{w} = \left( \mathbf{X}^T \mathbf{X} \right) ^{-1} \mathbf{X}^T \mathbf{y}$$