I'm reading the "Deep Learning"(Goodfellow et al, 2016) book and on pages 231-232(you can check them here) they show a very unique proof how L1 regularization makes model sparse. You can skip to the last two expressions for the actual question, but some context if you want it:
The regularized objetive function $\tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})$ is given by: $$ \tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha\|\boldsymbol{w}\|_{1}+J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})$$ where $\boldsymbol{w}$ is the parameter vector and $\boldsymbol{X}, \boldsymbol{y}$ are the design matrix(inputs) and the outputs of the model, respectively.
If we make a second order Taylor series approximation of the unregularized loss function of our model, which we will assume it is a linear model to ensure it has a clean analytical solution we have $$\hat{J}(\boldsymbol{\theta})=J\left(\boldsymbol{w}^{*}\right)+\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)^{\top} \boldsymbol{H}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)$$ where $\boldsymbol{H}$ is the Hessian matrix of $J$ with respect to $\boldsymbol{w}$ evaluated at $\boldsymbol{w}^{*}$ and there is no first-order term in this quadratic approximation because $\boldsymbol{w}^{*}$ is defined to be a minimum.
The minimum of $\hat{J}$ occurs where its gradient $$\nabla_{\boldsymbol{w}} \hat{J}(\boldsymbol{w})=\boldsymbol{H}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)$$
If we make the further assumption that the Hessian is diagonal $\boldsymbol{H}=\operatorname{diag}\left(\left[H_{1,1}, \ldots, H_{n, n}\right]\right), $ where each $ H_{i, i}>0$ (more details on the book), we have that our quadratic approximation of the L1 regularized objective function decomposes into a sum over the parameters: $$\hat{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=J\left(\boldsymbol{w}^{*} ; \boldsymbol{X}, \boldsymbol{y}\right)+\sum_{i}\left[\frac{1}{2} H_{i, i}\left(\boldsymbol{w}_{i}-\boldsymbol{w}_{i}^{*}\right)^{2}+\alpha\left|w_{i}\right|\right]$$
Then they say (and this is where I don't see how they did it, specifically that max function)
The problem of minimizing this approximate cost function has an analytical solution(for each dimension i), with the following form:
$$w_{i}=\operatorname{sign}\left(w_{i}^{*}\right) \max \left\{\left|w_{i}^{*}\right|-\frac{\alpha}{H_{i, i}}, 0\right\}$$
How did they get that expression? I've reached a similar one, but without that max function...
For further insights, please refer to the book.
You are seeking to minimize $g(w_i) := \frac{1}{2} H_{ii} (w_i - w^*_i)^2 + \alpha |w_i|$.
Approach using the theory of subdifferentials:
Let $f(x) = |x|$. The minimizer $\hat{w}_i$ must satisfy $$0 \in \partial g(\hat{w}_i) = H_{ii}(\hat{w}_i - w^*_i) + \alpha \partial f(\hat{w}_i)$$ where $$\partial f(x) = \begin{cases}\{1\} & x > 0 \\ [-1, 1] & x = 0 \\ \{-1\} & x < 0\end{cases}$$ We can rearrange the above expression $0 \in \partial g(\hat{w}_i)$ as \begin{align} -\frac{H_{ii}}{\alpha} (\hat{w}_i - w_i^*) \in \begin{cases}\{1\} & \hat{w}_i > 0 \\ [-1, 1] & \hat{w}_i = 0 \\ \{-1\} & \hat{w}_i < 0\end{cases} \end{align} You can check that the given solution satisfies this condition by checking the various cases of what $w^*_i$ is.