I am trying to derive the derivative of the loss function from least squares. If I have this (I am using ' to denote the transpose as in matlab)
(y-Xw)'(y-Xw)
and I expand it
=(y'- w'X')(y-Xw)
=y'y -y'Xw -w'X'y + w'X'Xw
=y'y -y'Xw -y'Xw + w'X'Xw
=y'y -2y'Xw + w'X'Xw
Now I get the gradient
=-2y'Xw + X'(Xw) + X(w'X')
=-2y'Xw + X'(Xw) + X'(Xw)
=-2y'Xw + 2X'(Xw)
And that is the intended result. Now, I saw in this post Vector derivation of $x^Tx$ That the gradient of x'x=2x, So I am trying to get the same result applying that, and the chain rule to get the gradient of
=(y-Xw)'(y-Xw)
So I think this might be
=2(y-Xw)(-X)
=-2yX + 2XwX
The result is similar but the transpositions are missing so it would not work... What am I doing wrong? My mathematical background has almost disappeared and I just started to begin the recovery, so please be patient if I did something terribly wrong...
Here is a piece of background information that we must be clear on at the beginning. If $F:\mathbb R^p \to \mathbb R^q$ is differentiable at a point $z$, then $F'(z)$ is a $q \times p$ matrix.
I assume $X$ is a real $m \times n$ matrix and $y$ is an $m \times 1$ column vector. Let $g:\mathbb R^m \to \mathbb R$ be defined by $g(u) = u^T u$. Note carefully that for any $u \in \mathbb R^m$, $g'(u) = 2 u^T$ is a $1 \times m$ matrix.
Define $h:\mathbb R^n \to \mathbb R^m$ by $h(w) = y - X w$, and note that $h'(w) = - X$.
Now let $f:\mathbb R^n \to \mathbb R$ be defined by $$ f(w) = g(h(w)) = (y - X w)^T(y - X w). $$ The chain rule tells us that $$ \underbrace{f'(w)}_{1 \times n} = \underbrace{g'(h(w))}_{1 \times m} \underbrace{h'(w)}_{m \times n}. $$
With our particular choices of $g$ and $h$, we have $$ f'(w) = 2 (y - X w)^T(-X). $$ If we use the convention that the gradient is a column vector, then we have $$ \nabla f(w) = f'(w)^T = 2 X^T (X w - y). $$