I was studying the Stephen Boyd's textbook on convex optimization. It says the following:
The amplitude distribution of the optimal residual for the l1-norm approximation problem will tend to have more zero and very small residuals , compared to the l2-norm approximation solution. In contrast, the l2-norm solution will tend to have relatively fewer large residuals (since large residuals incur a much larger penalty in l2-norm approximation than in l1-norm approximation).
I understand why the second sentence holds -- obviously, l2-norm places a higher penalty on a higher residual and hence would fewer higher residuals. But, I can't understand the first sentence. l1-norm places a higher penalty on the residuals between 0 and 1 than l2-norm and hence it seems to me that l2-norm should yield more small residuals. Can anybody explain to me why l1-norm generates more small residuals than l2 norm?
In fact, the two statements sounds contradictory to each other. If L2-norm generates fewer large residuals, it sounds like it generates more small residuals than L1-norm.
Let me highlight the parts of the sentence that should be grouped together:
This doesn't mean that you won't see large residuals in l1-norm problems (you have to kind of read between the lines). This means that minimizing l1 error will tend to produce solutions that have:
In other words, the distribution of residuals will be very "spiky." (This is good, for example, when you want to be robust to outliers -- this method "lets" you have a few large residuals (i.e., large errors) while keeping most of the errors small.)
L2 residuals, on the other hand, will produce:
In other words, the distribution of residuals will be far less "spiky" and more "even." (This is good when you have no outliers and you want to keep the overall error small -- it will produce a better "fit.")