When to use L2 regularization?

784 Views Asked by At

We know that L1 and L2 regularization are solutions to avoid overfitting.

L1 regularization, can lead to sparsity and therefore avoiding fitting to the noise. However, L2 does not.

So I wonder when there is a need to use L2 regularization?

1

There are 1 best solutions below

0
On

The sparsity assumption may not apply to every problem, in fact there is a lot of problems where it does not apply. Then using a $L^1$ regulzation technique has little effect on efficiency of the linear model. The $L^2$ regularization allows to have better performance on using the model while being simple to compute and with some useful theoretical results.

Let $x \in \mathbb{C}^n$, $y \in \mathbb{C}^m$ and $A \in \mathcal{M}_{m \times n}( \mathbb{C})$.

The usual least square problem is to find $x^*$ such that :

$$\|Ax^*-y\|_2^2=\min_x \|Ax-y\|_2^2$$

$\|\cdot\|_2$ being the Euclidean norm.

That problem can be solved anatically by taking the pseudo-inverse of $A$, noted $A^+$ here. We know that $x^* = A^+y$. The pseudo-inverse can be computed through the Singular Value Decomposition (or SVD). The SVD of $A$ is given by :

$$A=USV^*$$

with $U$ and $V$ unitary matrices and $S$ a rectangular matrix where only the diagonal coefficients can be non-zero, they are also positive numbers (so they are real), they are called singular values.

Then $A^+=VS^+U^*$, and $S^+$ is also diagonal (and rectangular) and the diagonal coefficient of $S^+$ are given by the inverse of the non-nul singular-values of $A$.

Suppose one of the singular value is very small relatively to others, then, by taking its inverse, one of the coefficient of $S^+$ will be very large. The problem is, small eigenvalues are often related to noise. When you are facing real-world data, which is always noisy, using this technique may then amplify it. The $L^2$ regularization is one way to prevent this phenomenon.

The $L^2$ regularization problem can be written this way:

$$\min_x \|Ax-y\|_2^2+ \| \Gamma x\|_2^2,$$

with $\Gamma$ a complex square matrix.

If $AA^*+\Gamma\Gamma^*$ is invertible and the problem is overdetermined, the solution is given by :

$$\bar x = (AA^*+\Gamma\Gamma^*)^{-1}A^*y.$$

The singular values of $A$ are the square roots of the eigenvalues of $AA^*$ (which is hermitian so can be diagonalized with real positive eigenvalues).

Suppose, for simplicity, that $\Gamma = a I$ with $I$ the identity matrix and $a$ a positive coefficient (which is the choice commonly made when using $L^2$ regularization). The eigenvalues of $AA^*+aI$ are equals to the eigenvalues of $AA^*$ plus $a$. In other "words" :

$$AA^*+aI = Q (D+aI)Q^*,$$

with $D$ a diagonal matrix made from the squares of the singular values of $A$ and $Q$ a unitary matrix.

Then, we are no longer having problem with taking the inverse of eventually small singular values, since we added a quantity $a$ to every of them (for a reasonnable choice of $a$). If you are facing an underdetermined problem, the reasonning stays the same but you will be working with $A^*A+aI$ instead.

Note that a similar way to solve a least square problem and avoiding noise can be done by using TSVD (truncated SVD) : just compute the SVD, and before getting the pseudo-inverse, set to $0$ every eigenvalue smaller than some treshold.