Derivative Least Square Regression (Numerator vs denominator layout)

57 Views Asked by At

Assume I have the following expression: $$\frac{\partial}{\partial w} \lVert X^Tw - y \rVert^2 = 0$$

which is trying to find the solution for the Least Squares Approach in Regression.

Let's assume $X \in \mathbb{R}^{D \times N}, w \in \mathbb{R}^{D \times 1}, y \in \mathbb{R}^{N \times 1}$. Thus, vectors have a column layout and the Matrix x has the single datapoint vectors next to each other (I think normally X is transposed in literature).

I want to solve this equation using the numerator and denominator layout. But my problem is that I'm not really confident on what implications these have and in general I'm not good with working with derivatives of matrices and calculating with them, which is part of the reason why I try to derive this formally correct.

If I'm not mistaken the layout should have two consequences:

  • The derivate of $X^Tw$ with respect to w. I'm not sure where I actually look this up. Is this rule number 69 in the matrix cookbook? What layout is the cookbook using, I don't find it specified anywhere. Anyways I figured it should be $X$ in the denominator case and $X^T$ in the numerator case.
  • The chain rule. For the denominator case, the inner derivative is multiplied to the left, whereas for the numerator case, the inner derivative is multiplied to the right

It would be good to have confirmation on these. Are there other important differences in rules when choosing one layout over the other

For the denominator case I get a solution (I think): $$ \begin{align} \frac{\partial}{\partial w} \lVert X^Tw - y \rVert^2 &= 0 \\ 2X(X^Tw - y) &= 0 \\ XX^Tw - Xy &= 0 \\ XX^Tw &= Xy \\ w &= (XX^T)^{-1}Xy \\ \end{align} $$

The solution should be correct. Due due the different layout of X the solution here as the transposes switched, which should be correct.

For the numerator case, I don't get the solution though: $$ \begin{align} \frac{\partial}{\partial w} \lVert X^Tw - y \rVert^2 &= 0 \\ 2(X^Tw - y)X^T &= 0 \\ X^TwX^T - yX^T &= 0 \\ \end{align} $$ Neither of these products is defined. Do I need to take special case when using distributivity? Or what is the problem here? My assumption here would be that the the solution is the same for both denominator and numerator layout.

Thanks!