The Proximal Operator and the Sub Differential of the $ {L}_{2} $ Norm of a Matrix

349 Views Asked by At

I have function \begin{equation} f(W)=\gamma ||W||_2 \end{equation} What is the prox operator of this?

1

There are 1 best solutions below

1
On BEST ANSWER

anything for a bounty ;-) (In fact, I was simply running out of time during the work week.)

From previous discussions we know that the author is actually referring to an elementwise 1-norm, not the induced 1-norm. That is, $\|W\|_1\triangleq \sum_{ij}|w_{ij}|$. This means that this function is separable across columns:

$$f(W) = \sum_{j=1}^T \left( \lambda \|w_j\|_1 + \gamma \|w_j\|_2 + \tfrac{1}{2} \|w_j - u_j\|_2^2 \right)$$ So we can focus our view to the vector function $$g(w) = \lambda \|w\|_1 + \gamma \|w\|_2 + \tfrac{1}{2} \|w - u\|_2^2$$ The optimality conditions are: $$\lambda v_1 + \gamma v_2 + w = u, \quad v_1\in\partial \|w\|_1, \quad v_2\in\partial \|w\|_2$$ $$\partial \|w\|_1 = \{v\,|\, \|v\|_\infty \leq 1, ~ \langle v, w \rangle = \|w\|_1\}$$ $$\partial \|w\|_2 = \{v\,|\, \|v\|_2 \leq 1, ~ \langle v, w \rangle = \|w\|_2\}$$ To solve, let's define the standard soft-thresholding operator: $$\mathop{\textrm{soft}}(x;\lambda)= \begin{cases} x - \lambda & x > \lambda \\ 0 & |x| \leq \lambda \\ x + \lambda & x < -\lambda \end{cases}$$ and extend it to apply elementwise to vectors. Then we choose $$[v_1]_i = \mathop{\textrm{sign}}(u_i)\min\{|u_i|/\lambda,1\}, ~i=1,2,\dots m$$ $$\quad\Longrightarrow u - \lambda v_1 = \mathop{\textrm{soft}}(u;\lambda)$$ This reduces the optimality conditions to $$\gamma v_2 + w = \mathop{\textrm{soft}}(u;\lambda)$$ Let's call that right-hand term $q$ and consider three cases:

  • $u=0$. $q=0$ as well, so we can choose $v_1=v_2=w=0$.
  • $\|q\|_2 \leq \gamma$. We can choose $v_2=\gamma^{-1} q$, so $w=0$.
  • $\|q\|_2 > \gamma$. We can choose $v_2=q/\|q\|_2$, so $w=(1-\gamma/\|q\|_2) q$.

If you're familiar with the proximal operator for $\ell_2$ alone, you should recognize that we're doing the exact same operation here. Let's call it "shrink": $$\mathop{\textrm{shrink}}(q; \gamma) = \begin{cases} 0 & \|q\|_2 \leq \gamma \\ (1-\gamma/\|q\|_2\}) q & \|q\|_2 > \gamma \end{cases}$$ Therefore, we have $$\mathop{\textrm{arg min}} g(w) = \mathop{\textrm{shrink}}(\mathop{\textrm{soft}}(u;\lambda); \gamma).$$ That's right: to compute the prox for $\ell_1$ plus $\ell_2$, we simply apply the $\ell_1$ prox first, then the $\ell_2$!

So for the original problem, $$\mathop{\textrm{arg min}} F(W) = \bar{W} = \begin{bmatrix} \bar{w}_1 & \dots & \bar{w}_T \end{bmatrix}, \quad \bar{w}_j = \mathop{\textrm{shrink}}(\mathop{\textrm{soft}}(u_j;\lambda); \gamma).$$