analysis of different L2 norms for regularization

303 Views Asked by At

In machine-learning, we often use the L2 norm to prevent the weight vector from being too "big" according to this norm and thus to try to generalize more from the trainig dataset. However it is also possible use different norms that are close to L2, that is, norms for which each component is weighted differently according to some criteria. For instance it could be: "put a higher borm weight on components of the weight vector that appear rarely", or any other ideas of this kind. Is there any articles/studies about this? ie using a tweaked L2 penalty and analyzing the results?

1

There are 1 best solutions below

0
On

Suppose you want to minimize a function $f$, but you know that the system is ill-conditioned. Then you might use a regularization function to make the problem more stable. So instead of minimizing $f$, you solve \begin{align} \text{minimize} \hspace{8pt} f(x) + \lambda \, R(x) \end{align} where $R$ is a regularization function and $\lambda$ is the regularization parameter.

You want to put as much information into your regularization as you can. When using just the $R=\|\cdot\|_2$ (the L2 norm) is the regularization function, what you're saying is that you know all of the values should be small. How small? That's what the regularization parameter is for. If $\lambda$ is large then you're saying that you know you want all of the values to be very small. If $\lambda$ is small, then you're saying that you want all of the values to somewhat small.

Using the $L2$ norm as a regularization function in this way gives you the optimal estimate when the noise in the system is Gaussian. Bayesian probabilists relate this to the prior distribution of your variable, which quantifies what you know about your variable prior to solving the problem.

But what if you know more? What if you know that the values should be small, but the first value should be 10 times smaller than then last. In that case, you use the weighted norm. And you weight the first component with 10 when all the other components are weighted with 1. This takes advantage of the additional information that you know.

What if you know that the values should be small, but occasionally there will be large values? Then you'd set $R=\|\cdot\|_1$; the L1 norm. And this corresponds to the optimal estimate when the noise is Laplacian.

Boyd and Vandenberghe have a great discussion of this in their book Convex Optimization, which you can get on the internet for free. If you have access to papers, an example of how a weighted norm can be used to good effect can be seen in this paper: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=7134771. (If you don't have easy access to papers, don't worry about this article. It wouldn't be that useful, just an example.)