How to apply gradient with respect to a vector

1.8k Views Asked by At

In Deep Learning (adapted from page 108), explaining linear regression as a machine learning algorithm, there is a passage for the solution of this expression:

To minimize $MSE$, we can simply solve for where its gradient is $0$: $$\nabla_{\mathbf w}MSE = 0$$

In addition, $\hat{\mathbf{y}}$ is defined as the prediction of the linear regression (also defined as $\mathbf X \mathbf w$, where $\mathbf{X}$ is the matrix of inputs and $\mathbf{w}$ is the weights vector), while $\mathbf{y}$ is defined as the real output value.

The solution follows this path:

$$\nabla_{\mathbf w}MSE = 0$$ $$\Rightarrow \nabla_{\mathbf w}\frac{1}{m}\lvert\lvert \hat {\mathbf{y}}-{\mathbf{y}}\rvert\rvert_2^2= 0$$ $$\Rightarrow \frac{1}{m} \nabla_{\mathbf w}\lvert\lvert {\mathbf{X}}\mathbf w -{\mathbf{y}}\rvert\rvert_2^2= 0$$ $$\Rightarrow \nabla_{\mathbf w} ( {\mathbf{X}}\mathbf w -{\mathbf{y}} )^{T} ( {\mathbf{X}}\mathbf w -{\mathbf{y}} ) = 0$$ $$\Rightarrow \nabla_{\mathbf w} ( \mathbf{w}^T{\mathbf{X}}^{T}{\mathbf{X}}\mathbf w - 2\mathbf{w}^T{\mathbf{X}}^{T}{\mathbf{y}} + {\mathbf{y}}^{T}{\mathbf{y}} ) = 0$$

Now, the subsequent step is:

$$\Rightarrow ( 2{\mathbf{X}}^{T}{\mathbf{X}}\mathbf w - 2{\mathbf{X}}^{T}{\mathbf{y}}) = 0$$

I think to understand that it takes the vector derivative with respect to $\mathbf{w}$, however I could not find the exact term of this derivative and consequently its rules to carry out the derivative myself (in particular, how to deal with transposed vectors and matrices).

1

There are 1 best solutions below

0
On

Using matrix transpose notation with vectors often confuses me. So I prefer to expand the norm using an explicit dot product instead $$\eqalign{ \|z\|^2_2 &= z\cdot z \cr }$$ In this form, finding the differential and the gradient of the norm is straightforward $$\eqalign{ d\|z\|^2_2 &= 2z\cdot dz \cr \frac{\partial\|z\|^2_2}{\partial z} &= 2z \cr\cr }$$ Now repeat the calculation for $\,\,z=(X\cdot w-y)$ $$\eqalign{ d\|z\|^2_2 &= 2z\cdot dz \cr &= 2z\cdot (X\cdot dw) \cr &= 2(X^T\cdot z)\cdot dw \cr \cr \frac{\partial\|z\|^2_2}{\partial w} &= 2X^T\cdot z \cr &= 2X^T\cdot(X\cdot w-y) \cr\cr }$$