Derivative of Least Squares with L2 Norm

595 Views Asked by At

I'm new to matrix calculus, and I've never really taken derivatives of summations before. Could someone show me how I would get the first order derivative of this?

$J(w)=\frac{1}{2}[\sum_{i=1}^{m}(w^Tx^{(i)}-y^{(i)})^2]+\lambda||w||_2^2$

Thanks!

1

There are 1 best solutions below

0
On

Our cost function is given by

$$J(\boldsymbol{w})=\dfrac{1}{2}\sum_{n=1}^{N}\left[\boldsymbol{w}^T\boldsymbol{x}_n-{y}_n \right]^2+\lambda\boldsymbol{w}^T\boldsymbol{w}$$ $$=\dfrac{1}{2}\sum_{n=1}^{N}\left[\boldsymbol{w}^T\boldsymbol{x}_n-{y}_n \right]^T\left[\boldsymbol{w}^T\boldsymbol{x}_n-{y}_n \right]+\lambda\boldsymbol{w}^T\boldsymbol{w}$$ $$=\dfrac{1}{2}\sum_{n=1}^{N}\left[\boldsymbol{x}^T_n\boldsymbol{w}^{}\boldsymbol{w}^T\boldsymbol{x}_n-y_n\boldsymbol{w}^T\boldsymbol{x}_n-\boldsymbol{x}^T_n\boldsymbol{w}y_n+y^2_n\right]+\lambda\boldsymbol{w}^T\boldsymbol{w}$$

Now, think of $\boldsymbol{w}$ as if it were a scalar and calculate the total derivative

$$dJ = \dfrac{1}{2}\sum_{n=1}^{N}\left[\boldsymbol{x}^T_nd\boldsymbol{w}^{}\boldsymbol{w}^T\boldsymbol{x}_n+\boldsymbol{x}^T_n\boldsymbol{w}^{}d\boldsymbol{w}^T\boldsymbol{x}_n-y_nd\boldsymbol{w}^T\boldsymbol{x}_n-\boldsymbol{x}^T_nd\boldsymbol{w}y_n\right]+\lambda d\boldsymbol{w}^T\boldsymbol{w} + \lambda \boldsymbol{w}^Td\boldsymbol{w}$$ $$= \dfrac{1}{2}\sum_{n=1}^{N}\left[\boldsymbol{x}^T_nd\boldsymbol{w}^{}\boldsymbol{w}^T\boldsymbol{x}_n+\boldsymbol{x}^T_n\boldsymbol{w}^{}d\boldsymbol{w}^T\boldsymbol{x}_n-2y_nd\boldsymbol{w}^T\boldsymbol{x}_n\right]+2\lambda d\boldsymbol{w}^T\boldsymbol{w}.$$ I used the product rule for the total derivative. Note, that the transpose of a scalar is scalar. I used this to combine the last terms. Now, we note that

$$\boldsymbol{x}^T_nd\boldsymbol{w}^{}\boldsymbol{w}^T\boldsymbol{x}_n=d\boldsymbol{w}^{T}\boldsymbol{x}_n\boldsymbol{x}^T_n\boldsymbol{w}^{}$$

and

$$\boldsymbol{x}^T_n\boldsymbol{w}^{}d\boldsymbol{w}^T\boldsymbol{x}_n=\boldsymbol{w}^T\boldsymbol{x}_n\boldsymbol{x}_n^Td\boldsymbol{w}^{}=d\boldsymbol{w}^{T}\boldsymbol{x}_n\boldsymbol{x}^T_n\boldsymbol{w}^{}$$ becaue both terms are scalars. Because of these observations we can rewrite the total derivate as

$$dJ = \dfrac{1}{2}\sum_{n=1}^{N}\left[2d\boldsymbol{w}^{T}\boldsymbol{x}_n\boldsymbol{x}^T_n\boldsymbol{w}^{}-2y_nd\boldsymbol{w}^T\boldsymbol{x}_n\right]+2\lambda d\boldsymbol{w}^T\boldsymbol{w}.$$

Factoring $d\boldsymbol{w}^T$ results in.

$$dJ =d\boldsymbol{w}^{T}\left[\dfrac{1}{2}\sum_{n=1}^{N}\left[2\boldsymbol{x}_n\boldsymbol{x}^T_n\boldsymbol{w}^{}-2y_n\boldsymbol{x}_n\right]+2\lambda \boldsymbol{w}\right].$$

The expression in the bracket on the right hand side is the Gradient of $J$ with respect to $\boldsymbol{w}$. Setting the gradient to zero and solving for $\boldsymbol{w}$ results in an estimate $\boldsymbol{\hat{w}}$ for $\boldsymbol{w}$

$$\boldsymbol{\hat{w}}=\left[\sum_{n=1}^{N}\boldsymbol{x}_n\boldsymbol{x}^T_n + 2\lambda\boldsymbol{I} \right]^{-1}\sum_{n=1}^{N}y_n\boldsymbol{x}_n.$$