Differentiable Approximation of the $ {L}_{1} $ Regularization

2.4k Views Asked by At

In machine learning we are often faced with optimization problems where we want to minimize some energy function using L1 regularization over some of the parameters, e.g.: $$ E(a,w) = [\text{sum of square errors}]-\lambda||a||_1, $$ where $a$ and $w$ are vectors of parameters.

If we take the standard L1 norm definition $||a||_1=\sum_i|a_i|$ then the optimization is complicated because this norm is not differentiable.

Is there a differentiable replacement for the L1 norm?

3

There are 3 best solutions below

2
On

I would disagree with the statement that the optimization is complicated - many pre-existing routines exist to solve $l^1$ regularized least squares. Just to name one, take a look at FISTA (or just search for 'iterative shrinkage thresholding').

If you're really looking for something differentiable, just consider

$$ \|a\|_{1+\epsilon} = \left(\sum \vert a_i\vert^{1+\epsilon}\right)^{\frac{1}{1+\epsilon}} $$ This is a slightly smoothed version of $\|a\|_1$. This offers no advantage over standard shrinkage methods, though.

0
On

Here are two approximations which are smooth and Lipschitz:

  • For every $\epsilon > 0$, $\|x\|_1 \le \sum_{i=1}^n \sqrt{x_i^2 + \epsilon^2}$, with equality in the limit $\epsilon \rightarrow 0^+$.
  • For every $\gamma_1,\ldots,\gamma_n > 0$, one has $\|x\|_1 \le \frac{1}{2}\sum_{i=1}^n \gamma_ix_i^2 + 1/\gamma_i$, with equality if $\gamma_i = 1/|x_i|$ for all $i=1,\ldots,n$.
4
On

What you're aksing is basically for a smoothed method for $ {L}_{1} $ Norm.

The most common smoothing approximation is done using the Huber Loss Function.
Its gradient is known ans replacing the $ {L}_{1} $ with it will result in a smooth objective function which you can apply Gradient Descent on.

Here is a MATLAB code for that (Validated against CVX):

function [ vX, mX ] = SolveLsL1Huber( mA, vB, paramLambda, numIterations )
% ----------------------------------------------------------------------------------------------- %
%[ vX, mX ] = SolveLsL1Huber( mA, vB, paramLambda, numIterations )
% Solve L1 Regularized Least Squares Using Smoothing (Huber Loss) Method.
% Input:
%   - mA                -   Input Matirx.
%                           The model matrix.
%                           Structure: Matrix (m X n).
%                           Type: 'Single' / 'Double'.
%                           Range: (-inf, inf).
%   - vB                -   input Vector.
%                           The model known data.
%                           Structure: Vector (m X 1).
%                           Type: 'Single' / 'Double'.
%                           Range: (-inf, inf).
%   - paramLambda       -   Parameter Lambda.
%                           The L1 Regularization parameter.
%                           Structure: Scalar.
%                           Type: 'Single' / 'Double'.
%                           Range: (0, inf).
%   - numIterations     -   Number of Iterations.
%                           Number of iterations of the algorithm.
%                           Structure: Scalar.
%                           Type: 'Single' / 'Double'.
%                           Range {1, 2, ...}.
% Output:
%   - vX                -   Output Vector.
%                           Structure: Vector (n X 1).
%                           Type: 'Single' / 'Double'.
%                           Range: (-inf, inf).
% References
%   1.  Huber Loss Wikipedia - https://en.wikipedia.org/wiki/Huber_loss.
% Remarks:
%   1.  As the smoothness term approaches zero the Huber Loss better
%       approximate the L1 Norm. Yet the lower the value the harder to
%       solve hence "Warm Start" is used.
% Known Issues:
%   1.  D.
% TODO:
%   1.  Add line search (Backtracking).
% Release Notes:
%   -   1.0.000     25/08/2017
%       *   First realease version.
% ----------------------------------------------------------------------------------------------- %

mAA = mA.' * mA;
vAb = mA.' * vB;
vX  = pinv(mA) * vB; %<! Dealing with "Fat Matrix"

lipConst = norm(mA, 2) ^ 2;%<! Lipschitz Constant;

paramMuBase     = 0.005; %<! Smoothness term in Huber Loss
stepSizeBase    = 1 / lipConst;

mX(:, 1) = vX;

for ii = 2:numIterations

    paramMu     = paramMuBase / log2(ii);
    stepSize    = stepSizeBase / log2(ii);

    vG = (mAA * vX) - vAb + (paramLambda * HuberLossGrad(vX, paramMu));
    vX = vX - (stepSize * vG);

    mX(:, ii) = vX;

end


end


function [ vG ] = HuberLossGrad( vX, paramMu )

vG = ((abs(vX) <= paramMu) .* (vX ./ paramMu)) + ((abs(vX) > paramMu) .* sign(vX));


end

The code above matches the form of Linear Least Squares:

$$ \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{2} $$

Yet you can easily adapt it to other forms.