Differentiable Approximation of the $ {L}_{1} $ Regularization

Question

Differentiable Approximation of the $ {L}_{1} $ Regularization

2.4k Views Asked by Bumbble Comm At 10 May 2026 - 7:02

In machine learning we are often faced with optimization problems where we want to minimize some energy function using L1 regularization over some of the parameters, e.g.: $$ E(a,w) = [\text{sum of square errors}]-\lambda||a||_1, $$ where $a$ and $w$ are vectors of parameters.

If we take the standard L1 norm definition $||a||_1=\sum_i|a_i|$ then the optimization is complicated because this norm is not differentiable.

Is there a differentiable replacement for the L1 norm?

Original Q&A

There are 3 best solutions below

**Bumbble Comm** · Answer 1 · 2014-10-10 03:37:02

I would disagree with the statement that the optimization is complicated - many pre-existing routines exist to solve $l^1$ regularized least squares. Just to name one, take a look at FISTA (or just search for 'iterative shrinkage thresholding').

If you're really looking for something differentiable, just consider

$$ \|a\|_{1+\epsilon} = \left(\sum \vert a_i\vert^{1+\epsilon}\right)^{\frac{1}{1+\epsilon}} $$ This is a slightly smoothed version of $\|a\|_1$. This offers no advantage over standard shrinkage methods, though.

**Bumbble Comm** · Answer 2 · 2020-04-25 08:11:56

Here are two approximations which are smooth and Lipschitz:

For every $\epsilon > 0$, $\|x\|_1 \le \sum_{i=1}^n \sqrt{x_i^2 + \epsilon^2}$, with equality in the limit $\epsilon \rightarrow 0^+$.
For every $\gamma_1,\ldots,\gamma_n > 0$, one has $\|x\|_1 \le \frac{1}{2}\sum_{i=1}^n \gamma_ix_i^2 + 1/\gamma_i$, with equality if $\gamma_i = 1/|x_i|$ for all $i=1,\ldots,n$.

**Bumbble Comm** · Answer 3 · 2017-08-24 18:08:35

What you're aksing is basically for a smoothed method for $ {L}_{1} $ Norm.

The most common smoothing approximation is done using the Huber Loss Function.
Its gradient is known ans replacing the $ {L}_{1} $ with it will result in a smooth objective function which you can apply Gradient Descent on.

Here is a MATLAB code for that (Validated against CVX):

function [ vX, mX ] = SolveLsL1Huber( mA, vB, paramLambda, numIterations )
% ----------------------------------------------------------------------------------------------- %
%[ vX, mX ] = SolveLsL1Huber( mA, vB, paramLambda, numIterations )
% Solve L1 Regularized Least Squares Using Smoothing (Huber Loss) Method.
% Input:
%   - mA                -   Input Matirx.
%                           The model matrix.
%                           Structure: Matrix (m X n).
%                           Type: 'Single' / 'Double'.
%                           Range: (-inf, inf).
%   - vB                -   input Vector.
%                           The model known data.
%                           Structure: Vector (m X 1).
%                           Type: 'Single' / 'Double'.
%                           Range: (-inf, inf).
%   - paramLambda       -   Parameter Lambda.
%                           The L1 Regularization parameter.
%                           Structure: Scalar.
%                           Type: 'Single' / 'Double'.
%                           Range: (0, inf).
%   - numIterations     -   Number of Iterations.
%                           Number of iterations of the algorithm.
%                           Structure: Scalar.
%                           Type: 'Single' / 'Double'.
%                           Range {1, 2, ...}.
% Output:
%   - vX                -   Output Vector.
%                           Structure: Vector (n X 1).
%                           Type: 'Single' / 'Double'.
%                           Range: (-inf, inf).
% References
%   1.  Huber Loss Wikipedia - https://en.wikipedia.org/wiki/Huber_loss.
% Remarks:
%   1.  As the smoothness term approaches zero the Huber Loss better
%       approximate the L1 Norm. Yet the lower the value the harder to
%       solve hence "Warm Start" is used.
% Known Issues:
%   1.  D.
% TODO:
%   1.  Add line search (Backtracking).
% Release Notes:
%   -   1.0.000     25/08/2017
%       *   First realease version.
% ----------------------------------------------------------------------------------------------- %

mAA = mA.' * mA;
vAb = mA.' * vB;
vX  = pinv(mA) * vB; %<! Dealing with "Fat Matrix"

lipConst = norm(mA, 2) ^ 2;%<! Lipschitz Constant;

paramMuBase     = 0.005; %<! Smoothness term in Huber Loss
stepSizeBase    = 1 / lipConst;

mX(:, 1) = vX;

for ii = 2:numIterations

    paramMu     = paramMuBase / log2(ii);
    stepSize    = stepSizeBase / log2(ii);

    vG = (mAA * vX) - vAb + (paramLambda * HuberLossGrad(vX, paramMu));
    vX = vX - (stepSize * vG);

    mX(:, ii) = vX;

end


end


function [ vG ] = HuberLossGrad( vX, paramMu )

vG = ((abs(vX) <= paramMu) .* (vX ./ paramMu)) + ((abs(vX) > paramMu) .* sign(vX));


end

The code above matches the form of Linear Least Squares:

$$ \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{2} $$

Yet you can easily adapt it to other forms.

Differentiable Approximation of the $ {L}_{1} $ Regularization

There are 3 best solutions below

Related Questions in OPTIMIZATION

Related Questions in MATLAB

Related Questions in MACHINE-LEARNING

Related Questions in REGULARIZATION

Trending Questions

Popular # Hahtags

Popular Questions