Single predictor soft thresholding

321 Views Asked by At

I have found in this book the following problem $$\underset{\beta}{\mathrm{minimize}}\left\{\frac{1}{2N}\sum\limits_{i=1}^N(y_i-z_i\beta)^2+\lambda|\beta|\right\}$$ for parameter $\beta>0$ known real sequences $y_i$s, $z_i$s and known $N\in\mathbb{N}$. The non-differentiability at zero and the fact that this cannot be decomposed into $N$ minimizations since $\beta$ is common for all measurements are my problems here. So (by inspection) they propose the following optimal solution for $\beta$: $$ \hat{\beta}= \left\{ \begin{array}{ll} \frac{1}{N}\langle\mathbf{z},\mathbf{y}\rangle-\lambda,&\quad\frac{1}{N}\langle{\mathbf{z}},\mathbf{y}\rangle>\lambda\\ 0,&\quad\left|\frac{1}{N}\langle{\mathbf{z}},\mathbf{y}\rangle\right|\leq\lambda\\ \frac{1}{N}\langle\mathbf{z},\mathbf{y}\rangle+\lambda,&\quad\frac{1}{N}\langle{\mathbf{z}},\mathbf{y}\rangle<-\lambda \end{array} \right. $$ But the proof of this is not clear to me. Can you give any hints?

2

There are 2 best solutions below

0
On

This boils down to minimizing the function $$ f(x) =\frac12(x-y)^2+\lambda|x| $$ as a function of $x\in\mathbb R$, where $y\in\mathbb R$ and $\lambda>0$. It seems that this can be done by inspection, but I would suggest to use the concept of subdifferentials and look for the subdifferential that contains $0$ as its element.

Denote the subdifferential of $f$ at $x$ by $\partial f(x)$. Then $$ \partial f(x)= \begin{cases} \{x-y-\lambda\}&\text{if}\ x<0;\\ [-y-\lambda,-y+\lambda]&\text{if}\ x=0;\\ \{x-y+\lambda\}&\text{if}\ x>0. \end{cases} $$ Hence, $$ x^\star= \begin{cases} y+\lambda&\text{if}\ y<-\lambda;\\ 0&\text{if}\ |y|\le\lambda;\\ y-\lambda&\text{if}\ y>\lambda. \end{cases} $$ Equivalently, $$ x^\star =\mathcal S_\lambda(y) =\operatorname{sign}(y)(|y|-\lambda)_+, $$ where $\mathcal S_\lambda:\mathbb R\to\mathbb R$ is the soft-thresholding operator with $\lambda\ge0$.

We have that \begin{align*} &\frac1{2N}\sum_{i=1}^N(y_i-z_i\beta)^2+\lambda|\beta|=\\ &=\frac1{2N}\sum_{i=1}^N(y_i^2-2y_iz_i\beta+z_i^2\beta^2)+\lambda|\beta|\\ &=\frac12\Bigl[N^{-1}\sum_{i=1}^Ny_i^2-2N^{-1}\sum_{i=1}^Ny_iz_i\beta+N^{-1}\sum_{i=1}^Nz_i^2\beta^2\Bigr]+\lambda|\beta|\\ &=\frac12[N^{-1}\|y\|_2^2-N^{-2}(y^{\operatorname T}z)^2]+\frac12[\beta^2-2\beta N^{-1}y^{\operatorname T}z+N^{-2}(y^{\operatorname T}z)^2]+\lambda|\beta|\\ &=\frac12[N^{-1}\|y\|_2^2-N^{-2}(y^{\operatorname T}z)^2]+\frac12(\beta-N^{-1}y^{\operatorname T}z)^2+\lambda|\beta|, \end{align*} where we used the assumption that $N^{-1}\sum_{i=1}^Nz_i^2=1$. It follows that $$ \hat\beta=\mathcal S_\lambda(N^{-1}y^{\operatorname T}z) $$ or, equivalently, $$ \hat\beta= \begin{cases} N^{-1}y^{\operatorname T}z+\lambda&\text{if}\ N^{-1}y^{\operatorname T}z<-\lambda;\\ 0&\text{if}\ |N^{-1}y^{\operatorname T}z|\le\lambda;\\ N^{-1}y^{\operatorname T}z-\lambda&\text{if}\ N^{-1}y^{\operatorname T}z>\lambda. \end{cases} $$

I hope this helps.

0
On

The exercise in question (2.2 of SLS) actually instructs you not to make use of subgradients, so here is such a proof. Let $\hat{\beta}_o=\mathbf{z}^\intercal\mathbf{y}/N$ be the usual OLS estimator, and define $\ell(x)$ to be the loss function in the minimization problem. We have

$$ \ell(b)-\ell(a)=\frac{b^{2}-a^{2}}{2}+\hat{\beta}_{o}(a-b)+\lambda(|b|-|a|). $$ If $|\hat{\beta}_{o}|<\lambda$ then $$ \ell(b)-\ell(0)\ge\frac{b^{2}}{2}-\hat{\beta}_{o}b+|\hat{\beta}_{o}b|\ge0, $$ so the minimum is achieved when $b=0$.

Otherwise, suppose $\hat{\beta}_{o}>\lambda$, so that

$$\ell(b)-\ell(\hat{\beta}_{o}-\lambda)\ge\frac{1}{2}\left(2\lambda|b|+b^{2}-2\hat{\beta}_{o}b+(\hat{\beta}_{o}-\lambda)^{2}\right). $$

If $b<0$ then this is obviously non-negative, while if $b>0$ then $$ \ell(b)-\ell(\hat{\beta}_{o}-\lambda)\ge \frac{1}{2}(b-\hat{\beta}_{o}+\lambda)^{2}. $$ Thus the loss function is minimized when $b=\hat\beta_o-\lambda$. The case $\hat\beta_o<-\lambda$ follows from applying the above result to the predictor $-\mathbf{z}.$