Help in understanding the gradient derivation

46 Views Asked by At

I am going through the following derivation of finding the gradient of the mean squared error of a simple linear regression problem. I am having trouble following along as my vector calc is bit rusty and was hoping someone could perhaps shed light on how the step where the gradient is applied was derived.

Given $\mathbf{X}\in\mathbb{R}^{m\times n}$, $\mathbf{w}\in\mathbb{R}^{n\times1}$ and $\mathbf{y}\in\mathbb{R}^{m\times1}$

$$\nabla_\mathbf{w}||\mathbf{X}\mathbf{w}-\mathbf{y}||_2^2=0$$

$$\implies \nabla_\mathbf{w}(\mathbf{X}\mathbf{w}-\mathbf{y})^\top(\mathbf{X}\mathbf{w}-\mathbf{y})=0$$

$$\implies \nabla_\mathbf{w}(\mathbf{w}^\top\mathbf{X}^\top-\mathbf{y}^\top)(\mathbf{X}\mathbf{w}-\mathbf{y})=0$$

$$\implies \nabla_\mathbf{w}(\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}-\mathbf{y}^\top\mathbf{X}\mathbf{w}+\mathbf{y}^\top\mathbf{y})=0$$

Since $\mathbf{a}^\top=\mathbf{a}$ if $\mathbf{a}$ is a scalar

$$\implies \nabla_\mathbf{w}(\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}+\mathbf{y}^\top\mathbf{y})=0$$

$$\implies 2\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\mathbf{X}^\top\mathbf{y}=0$$ The above step is where I am getting lost; how does applying the gradient yield that result?

My attempt at breaking that step down into understandable chunks is as follows: $$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-\nabla_\mathbf{w}2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}+\nabla_\mathbf{w}\mathbf{y}^\top\mathbf{y}$$

Since $\nabla_\mathbf{w}\mathbf{y}^\top\mathbf{y}=0$ $$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}$$

With $2\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}$ I can take the transpose since it $\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}$ is a scalar and then apply the gradient but this yields $2\mathbf{y}^{\top}\mathbf{X}$?

$$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\nabla_\mathbf{w}\mathbf{y}^\top\mathbf{X}\mathbf{w}$$

$$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\mathbf{y}^\top\mathbf{X}$$

But now sort of unsure as don't fully know how to apply the gradient to $\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}$, perhaps the dot product rule if we set $\mathbf{A}=\mathbf{X}\mathbf{w}$ then $\frac{1}{2}\nabla(\mathbf{A}\cdot\mathbf{A})=(\nabla\mathbf{A})\cdot\mathbf{A}$? That would yield:

$$2\mathbf{X}\cdot\mathbf{Xw}-2\mathbf{y}^\top\mathbf{X}$$ $$2\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}-2\mathbf{y}^\top\mathbf{X}$$ And then take the transpose?

$$(2\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}-2\mathbf{y}^\top\mathbf{X})^\top=2\mathbf{X}^\top\mathbf{Xw}-2\mathbf{X}^\top\mathbf{y}$$

Which seems like the correct thing but now its a vector in $\mathbb{R}^{n\times1}$ but I was expecting a scalar $\mathbb{R}$ since $0$ is a scalar in $\mathbb{R}$?

Any help here would be greatly appreciated :) thanks

1

There are 1 best solutions below

1
On BEST ANSWER

It is usually easier to do the differentiation before doing any expansions.
In fact, if you define the vector $$v=(Xw-y)$$ then the objective function becomes $$\eqalign{ \lambda &= \|v\|^2 = v^Tv \\ }$$ which is trivial to differentiate $$\eqalign{ d\lambda &= 2v^T dv \\ }$$ Substituting the original variables yields $$\eqalign{ d\lambda &= 2(Xw-y)^T(X\:dw) \quad\implies\quad \nabla_w\lambda = 2(Xw-y)^TX \\ }$$ Depending on your preferred layout convention you may wish to transpose this result.