How does this L1 regularization derivation follow? (Proof it makes sparse models)

Question

How does this L1 regularization derivation follow? (Proof it makes sparse models)

963 Views Asked by Bumbble Comm At 10 May 2026 - 10:04

I'm reading the "Deep Learning"(Goodfellow et al, 2016) book and on pages 231-232(you can check them here) they show a very unique proof how L1 regularization makes model sparse. You can skip to the last two expressions for the actual question, but some context if you want it:

The regularized objetive function $\tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})$ is given by: $$ \tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha\|\boldsymbol{w}\|_{1}+J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})$$ where $\boldsymbol{w}$ is the parameter vector and $\boldsymbol{X}, \boldsymbol{y}$ are the design matrix(inputs) and the outputs of the model, respectively.

If we make a second order Taylor series approximation of the unregularized loss function of our model, which we will assume it is a linear model to ensure it has a clean analytical solution we have $$\hat{J}(\boldsymbol{\theta})=J\left(\boldsymbol{w}^{*}\right)+\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)^{\top} \boldsymbol{H}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)$$ where $\boldsymbol{H}$ is the Hessian matrix of $J$ with respect to $\boldsymbol{w}$ evaluated at $\boldsymbol{w}^{*}$ and there is no ﬁrst-order term in this quadratic approximation because $\boldsymbol{w}^{*}$ is deﬁned to be a minimum.

The minimum of $\hat{J}$ occurs where its gradient $$\nabla_{\boldsymbol{w}} \hat{J}(\boldsymbol{w})=\boldsymbol{H}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)$$

If we make the further assumption that the Hessian is diagonal $\boldsymbol{H}=\operatorname{diag}\left(\left[H_{1,1}, \ldots, H_{n, n}\right]\right), $ where each $ H_{i, i}>0$ (more details on the book), we have that our quadratic approximation of the L1 regularized objective function decomposes into a sum over the parameters: $$\hat{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=J\left(\boldsymbol{w}^{*} ; \boldsymbol{X}, \boldsymbol{y}\right)+\sum_{i}\left[\frac{1}{2} H_{i, i}\left(\boldsymbol{w}_{i}-\boldsymbol{w}_{i}^{*}\right)^{2}+\alpha\left|w_{i}\right|\right]$$

Then they say (and this is where I don't see how they did it, specifically that max function)

The problem of minimizing this approximate cost function has an analytical solution(for each dimension i), with the following form:
$$w_{i}=\operatorname{sign}\left(w_{i}^{*}\right) \max \left\{\left|w_{i}^{*}\right|-\frac{\alpha}{H_{i, i}}, 0\right\}$$

How did they get that expression? I've reached a similar one, but without that max function...

For further insights, please refer to the book.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

You are seeking to minimize $g(w_i) := \frac{1}{2} H_{ii} (w_i - w^*_i)^2 + \alpha |w_i|$.

Approach using the theory of subdifferentials:

Let $f(x) = |x|$. The minimizer $\hat{w}_i$ must satisfy $$0 \in \partial g(\hat{w}_i) = H_{ii}(\hat{w}_i - w^*_i) + \alpha \partial f(\hat{w}_i)$$ where $$\partial f(x) = \begin{cases}\{1\} & x > 0 \\ [-1, 1] & x = 0 \\ \{-1\} & x < 0\end{cases}$$ We can rearrange the above expression $0 \in \partial g(\hat{w}_i)$ as \begin{align} -\frac{H_{ii}}{\alpha} (\hat{w}_i - w_i^*) \in \begin{cases}\{1\} & \hat{w}_i > 0 \\ [-1, 1] & \hat{w}_i = 0 \\ \{-1\} & \hat{w}_i < 0\end{cases} \end{align} You can check that the given solution satisfies this condition by checking the various cases of what $w^*_i$ is.

How does this L1 regularization derivation follow? (Proof it makes sparse models)

There are 1 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in DERIVATIVES

Related Questions in NORMED-SPACES

Related Questions in MACHINE-LEARNING

Related Questions in REGULARIZATION

Trending Questions

Popular # Hahtags

Popular Questions