Unclear about matrix calculus in least squares regression

1.4k Views Asked by At

The loss function of a Least Squares Regression is defined as (for example, in this question) :

$L(w) = (y - Xw)^T (y - Xw) = (y^T - w^TX^T)(y - Xw)$

Taking the derivatives of the loss w.r.t. the parameter vector $w$:

\begin{align} \frac{d L(w)}{d w} & = \frac{d}{dw} (y^T - w^TX^T)(y - Xw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - w^TX^Ty + w^TX^TXw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - (y^TXw)^T + w^TX^TXw) \end{align}

as the second and third terms are scalars resulting in the same quantity, this implies,

\begin{align} & = \frac{d}{dw} (y^Ty - 2y^TXw + w^TX^TXw) \end{align}

My question is:

for the second term, shouldn't the derivative wrt $w$ be $-2y^TX$ ?

and because $\frac{d}{dx}(x^TAx) = x^T(A^T + A)$, (see this question for explanation)

shouldn't the derivative for the third term (which is also a scalar), be the following due to chain rule? \begin{align} \frac{d}{dw} (w^TX^TXw) + \frac{d}{dw} w(X^TXw)^T = w^T(X^TX + X^TX) = 2 w^TX^TX \end{align}

From the above expressions, shouldn't the result of the derivative of the loss function be: $-2y^TX + 2 w^TX^TX$ ?

What I see in textbooks (including, for example, page 25 of this stanford.edu notes and page 10 of this harvard.edu notes ) is a different expression: $-2X^Ty + 2 X^TXw$.

What am I missing here?

2

There are 2 best solutions below

2
On

Let $z=(Xw-y)$, then the loss function can be expressed in terms of the Frobenius norm or better yet, the Frobenius product as $$L=\|z\|^2_F = z:z$$ The differential of this function is simply $$\eqalign{ dL &= 2\,z:dz \cr &= 2\,z:X\,dw \cr &= 2\,X^Tz:dw \cr }$$ Since $dL=\frac{\partial L}{\partial w}:dw,\,$ the gradient is $$\eqalign{ \frac{\partial L}{\partial w} &= 2\,X^Tz \cr &= 2\,X^T(Xw-y) \cr }$$ The advantage of this derivation is that it holds true even if the vectors $\{w,y,z\}$ are replaced by rectangular matrices.

2
On
  1. Let $A=Xw-y$ and find the derivative map of the squared norm $L=\|A\|^{2}$: $D_A\|A\|^{2}(H)=\left.\frac{d}{dt}\right|_{0}\|A+tH\|^{2}=\left.\frac{d}{dt}\right|_{0}\langle A+tH,A+tH\rangle=2\langle A,H\rangle$

  2. Use the chain rule $D_{w}(L\circ A)=D_{A}L\circ D_{w}A$ as follows, $D_w\|A(w)\|^{2}(h)=\left.\frac{d}{dt}\right|_{0}\|A(w+th)\|^{2}=2\langle A,D_wA(h)\rangle$

  3. Arrive at the result: $2\langle Xw-y,Xh\rangle$

  4. Rewrite $2\langle Xw-y,Xh\rangle = 2\langle X^T(Xw-y),h\rangle$ and define the gradient vector $2X^T(Xw-y)$