How Does $ {L}_{1} $ Regularization Present Itself in Gradient Descent?

9.9k Views Asked by At

If we incorporated $ {L}_{1} $ Loss in gradient descent, how would the update rule change? It's easy to write down the optimization objective. But I'm not sure what to put for the update rule.

2

There are 2 best solutions below

0
On BEST ANSWER

The problem is that the gradient of the norm does not exist at $0$, so you need to be careful

$$ E_{L_1} = E + \lambda\sum_{k=1}^N|\beta_k| $$

where $E$ is the cost function (E stands for error), which I will assume you already know how to calculate the gradient for.

As for the regularization term, note that if $\beta_k > 0$ then $|\beta_k| = \beta_k$ and the gradient is $+1$, similarly when $\beta_k < 0$ the gradient is $-1$, so in summary

$$ \frac{\partial |\beta_k|}{\partial \beta_l} = {\rm sgn}(\beta_k)\delta_{kl} $$

so that

$$ \frac{\partial E_{L_1}}{\partial \beta_l} = \frac{\partial E}{\partial \beta_l} + \lambda\sum_{k=1}^N {\rm sgn}(\beta_k)\delta_{kl} = \frac{\partial E}{\partial \beta_l} + \lambda {\rm sgn}(\beta_l) $$

2
On

It changes the direction you descent towards.

You may have a look at this PDF - Steepest Descent Direction for Various Norms.
It shows the direction for few different norms.