In the sparse group lasso paper SGLpaper, they derive a condition for $\beta=0$ from the equation: $$ \frac{1}n {X^{(k)}}^T(y-\sum_{l=1}^m X^{(l)} \beta^{(l)}) = (1-\alpha)\lambda u+ \alpha \lambda v$$ where m is the number of groups. Similar to linear regression model, $X$ is the n by p design matrix,$X^{(l)}$ is the submatrix of design matrix with each $X^{(k)}$ an n by $p_l$ matrix, where $p_l$ is the number of covariates in group $l$, $Y$ is the n response vector, $\beta$ is the coefficients to be estimated, $\alpha$ and $\lambda$ nonnegative parameters that are smaller than or equal to 1. u and v are subgradients of $||\beta^{(k)} ||_2$ and $||\beta^{(k)} ||_1$ respectively evaluated at $\beta^{(k)}$. $$ u \in \{u:||u ||_2\ \leq1\}\ if B^{(k)}=0$$ $$ v_j \in \{v_j:|v_j|\ \leq1\}\ if B_j^{(k)}=0$$ They say 'With a little bit of algebra, we see that the subgradient equations are satisfied with $B^{(k)}=0$ if: $$ ||S({X^{(k)}}^T r_{(-k)}/n, \alpha \lambda)||_2 \leq (1-\alpha)\lambda$$ where $r_{(-k)}$ the partial residual of y, subtracting all group fits other than group k: $$r_{(-k)}=y-\sum_{l \neq k} X^{(l)} \beta^{(l)}$$ and with $S(·)$ the coordinate-wise soft thresholding operator: $$(S(z, \alpha \lambda))_j = sign(z_j)(|z_j| − \alpha \lambda)_+ .$$
I tried to derive same inequality but I ended up with these: $$||\frac{1}n {X^{(k)}}^T r_{(-k)} - \alpha \lambda [-1,1]||_2 \leq (1-\alpha)\lambda $$
$$||\ |\frac{1}n {X^{(k)}}^T r_{(-k)}| - \alpha \lambda ||_2 \leq (1-\alpha)\lambda $$ But I couldn't figure out how to apply soft thresholding operator.
I would really appreciate help on solving my question. Thanks very much.
You already found $$||\frac{1}{n} {X^{(k)}}^T r_{(-k)} - \alpha \lambda [-1,1]||_2 \leq (1-\alpha)\lambda, \qquad (1) $$ where you can choose any value in $[-1,1]$ to satisfy the inequality, for each element of the vector separately. Let us see how you would select this value for the $j^{th}$ component. The goal is to satisfy inequality (1), so you want the component to be close to $0$. The value you select depends on $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j \quad (2)$$ You select $1$ (from the interval $[-1,1]$ when (2) is larger than $\alpha\lambda$, $-1$ when (2) is smaller than $-\alpha\lambda$, and (2) divided by $\alpha\lambda$ when (2) is in $[-\alpha\lambda,\alpha\lambda]$. The last displayed formula in your question does not take the last case into account, while the $(\cdot)_+$ operator does.
When (2) is larger than $\alpha\lambda$, the $j^{th}$ component equals: $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j-\alpha\lambda.$$ When (2) is smaller than $\alpha\lambda$, the $j^{th}$ component equals: $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j+\alpha\lambda.$$ When (2) is in $[-\alpha\lambda,\alpha\lambda]$, the $j^{th}$ component equals $0$. For each of these cases, it is easy to see that they equal $$\text{sign}\left(\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j\right)\left(\left|\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j\right|- \alpha \lambda\right)_+.$$ Note that the sign operator can be omitted, since it does not affect the value of the 2-norm in (1).