Help in understanding the gradient derivation

Question

Help in understanding the gradient derivation

46 Views Asked by Bumbble Comm At 02 Apr 2026 - 7:58

I am going through the following derivation of finding the gradient of the mean squared error of a simple linear regression problem. I am having trouble following along as my vector calc is bit rusty and was hoping someone could perhaps shed light on how the step where the gradient is applied was derived.

Given $\mathbf{X}\in\mathbb{R}^{m\times n}$, $\mathbf{w}\in\mathbb{R}^{n\times1}$ and $\mathbf{y}\in\mathbb{R}^{m\times1}$

$$\nabla_\mathbf{w}||\mathbf{X}\mathbf{w}-\mathbf{y}||_2^2=0$$

$$\implies \nabla_\mathbf{w}(\mathbf{X}\mathbf{w}-\mathbf{y})^\top(\mathbf{X}\mathbf{w}-\mathbf{y})=0$$

$$\implies \nabla_\mathbf{w}(\mathbf{w}^\top\mathbf{X}^\top-\mathbf{y}^\top)(\mathbf{X}\mathbf{w}-\mathbf{y})=0$$

$$\implies \nabla_\mathbf{w}(\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}-\mathbf{y}^\top\mathbf{X}\mathbf{w}+\mathbf{y}^\top\mathbf{y})=0$$

Since $\mathbf{a}^\top=\mathbf{a}$ if $\mathbf{a}$ is a scalar

$$\implies \nabla_\mathbf{w}(\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}+\mathbf{y}^\top\mathbf{y})=0$$

$$\implies 2\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\mathbf{X}^\top\mathbf{y}=0$$ The above step is where I am getting lost; how does applying the gradient yield that result?

My attempt at breaking that step down into understandable chunks is as follows: $$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-\nabla_\mathbf{w}2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}+\nabla_\mathbf{w}\mathbf{y}^\top\mathbf{y}$$

Since $\nabla_\mathbf{w}\mathbf{y}^\top\mathbf{y}=0$ $$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}$$

With $2\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}$ I can take the transpose since it $\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}$ is a scalar and then apply the gradient but this yields $2\mathbf{y}^{\top}\mathbf{X}$?

$$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\nabla_\mathbf{w}\mathbf{y}^\top\mathbf{X}\mathbf{w}$$

$$\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}-2\mathbf{y}^\top\mathbf{X}$$

But now sort of unsure as don't fully know how to apply the gradient to $\nabla_\mathbf{w}\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}$, perhaps the dot product rule if we set $\mathbf{A}=\mathbf{X}\mathbf{w}$ then $\frac{1}{2}\nabla(\mathbf{A}\cdot\mathbf{A})=(\nabla\mathbf{A})\cdot\mathbf{A}$? That would yield:

$$2\mathbf{X}\cdot\mathbf{Xw}-2\mathbf{y}^\top\mathbf{X}$$ $$2\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}-2\mathbf{y}^\top\mathbf{X}$$ And then take the transpose?

$$(2\mathbf{w}^\top\mathbf{X}^\top\mathbf{X}-2\mathbf{y}^\top\mathbf{X})^\top=2\mathbf{X}^\top\mathbf{Xw}-2\mathbf{X}^\top\mathbf{y}$$

Which seems like the correct thing but now its a vector in $\mathbb{R}^{n\times1}$ but I was expecting a scalar $\mathbb{R}$ since $0$ is a scalar in $\mathbb{R}$?

Any help here would be greatly appreciated :) thanks

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

It is usually easier to do the differentiation before doing any expansions.
In fact, if you define the vector $$v=(Xw-y)$$ then the objective function becomes $$\eqalign{ \lambda &= \|v\|^2 = v^Tv \\ }$$ which is trivial to differentiate $$\eqalign{ d\lambda &= 2v^T dv \\ }$$ Substituting the original variables yields $$\eqalign{ d\lambda &= 2(Xw-y)^T(X\:dw) \quad\implies\quad \nabla_w\lambda = 2(Xw-y)^TX \\ }$$ Depending on your preferred layout convention you may wish to transpose this result.

Help in understanding the gradient derivation

There are 1 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions