L1 norm and L2 norm

17.2k Views Asked by At

I was studying the Stephen Boyd's textbook on convex optimization. It says the following:

The amplitude distribution of the optimal residual for the l1-norm approximation problem will tend to have more zero and very small residuals , compared to the l2-norm approximation solution. In contrast, the l2-norm solution will tend to have relatively fewer large residuals (since large residuals incur a much larger penalty in l2-norm approximation than in l1-norm approximation).

I understand why the second sentence holds -- obviously, l2-norm places a higher penalty on a higher residual and hence would fewer higher residuals. But, I can't understand the first sentence. l1-norm places a higher penalty on the residuals between 0 and 1 than l2-norm and hence it seems to me that l2-norm should yield more small residuals. Can anybody explain to me why l1-norm generates more small residuals than l2 norm?

In fact, the two statements sounds contradictory to each other. If L2-norm generates fewer large residuals, it sounds like it generates more small residuals than L1-norm.

1

There are 1 best solutions below

3
On

Let me highlight the parts of the sentence that should be grouped together:

The amplitude distribution of the optimal residual for the l1-norm approximation problem will tend to have more (zero and very small residuals), compared to the l2-norm approximation solution. In contrast, the l2-norm solution will tend to have relatively fewer (large residuals) (since large residuals incur a much larger penalty in l2-norm approximation than in l1-norm approximation).

This doesn't mean that you won't see large residuals in l1-norm problems (you have to kind of read between the lines). This means that minimizing l1 error will tend to produce solutions that have:

  • a few residuals that are larger and
  • lots of very insignificant residuals.

In other words, the distribution of residuals will be very "spiky." (This is good, for example, when you want to be robust to outliers -- this method "lets" you have a few large residuals (i.e., large errors) while keeping most of the errors small.)

L2 residuals, on the other hand, will produce:

  • very few big residuals, because they're penalized a lot more,
  • but at the cost of having lots more small residuals that are still significant.

In other words, the distribution of residuals will be far less "spiky" and more "even." (This is good when you have no outliers and you want to keep the overall error small -- it will produce a better "fit.")