What is the smooth approximation of group lasso penalty?

810 Views Asked by At

Given group lasso penalty that is defined as $$g(W)= \lambda \sum_j \|W_j\|_2$$ is it possible to drive an smooth approximation this group penalty. I looked at following paper by Nestrove, but it was not easy to understand it.

Smooth minimization of non-smooth functions

I would like to solve this optimization problem very efficently. \begin{equation} \hat{W}={arg\,min}_W \frac{1}{2}||Y-XW||^2 + \lambda ||W||_1 + \gamma \sum_{i=1}^{k} ||W_i||_2 \end{equation} Whre $W \in R{M \times T}$ , $W_i$ $ith$ row of $W$ and $Y \in R^{N \times T}$. I would like to write original problem as : \begin{equation} \hat{W}={arg\,min}_W g(W) + \lambda ||W||_1 + h(W) \end{equation} Where $g(W) =\frac{1}{2}||Y-XW||^2$, $h(W)$: Smooth approximation of $\gamma \sum_{i=1}^{k} ||W_i||_2$, then $F(W)=g(W)+h(W)$ will be differential function and I can easily compute it's gradient.This can help me to solve original problem : $$F(W) + \lambda ||W||_1$$ using proximal gradient decent, since proximal of $||W||_1$ is known and can be computed very fast.

2

There are 2 best solutions below

3
On BEST ANSWER

For a compact convex subset $K$ of a finite dimensional Hilbert space $X$ and a bounded linear operator $A: Z \rightarrow X^*$ operator, consider the composite function $f =\sigma_K \circ A$, where $\sigma_K$ is the support of $K$, defined by $$ \sigma_K(z) := \sup_{y \in K}\langle z, y\rangle,\; \forall z \in X. $$

Now, Nestorov has introduced a class of smoothing approximations defined by

$$ f_\mu := \sup_{y \in K}\langle Ax, y\rangle - \frac{1}{2}\mu\|y\|_2^2, $$ for a smoothing parameter $\mu > 0$.

It's not too hard to see that

  • $f_\mu$ is $\mu$-strongly convex.
  • $|f_\mu - f| = \mathcal O(\mu) \rightarrow 0$ uniformly as $\mu \rightarrow 0^+$.
  • Using Danskin's Theorem, one can show that $f_\mu$ is smooth with $(1/\mu)$-Lipschitz gradient given by $f_\mu'(x) = A^T\hat{y}_\mu(x)$, where $$\hat{y}_\mu(x) := \text{argmax}_{y \in K}\langle Ax,y\rangle - \frac{1}{2}\mu\|y\|_2^2 = \text{Proj}_K\left(\frac{1}{\mu}Ax\right), $$ where $\text{Proj}$ is the euclidean projection operator onto $K$.

Thus if $K$ is "simple" (in the sense that $\text{Proj}_{K}$ can be easily applied) then you can imagine a very simple and intuitive algorithm for minimizing $f$: minimize $f_\mu$, reduce $\mu$, minimize $f_\mu$, reduce $\mu$, ... (until some desired tolerance is hit)

This technique is called Nesterov's smoothing method for functions of "max-type", and can be used in areas as diverse as game theory (solving for Nash equilibria for incomplete information sequential zero-sum games) and image processing (total-variation minimization, etc.)

Returning to the original question: Group Lasso

In the case of the group Lasso, you can take $X = \mathbb R^{p \times g}$ and $K := \{z \in X | \|z\|_{2,1} \le 1\}$ (unit ball for the $\ell_{2,1}$ mixed norm) and $A = \text{ identity matrix}$.

I hope this helps!

3
On

Any convex (and lower semicontinuous) function $f$ has a smooth convex approximation called the Moreau-Yosida regularization of $f$ (with parameter $\mu > 0$) defined by \begin{align} f^{(\mu)}(x) &= \text{min}_u \, f(u) + \frac{1}{2\mu} \| u - x \|_2^2 \\ \end{align} (This looks like the definition of the prox-operator of $f$, but we have "min" instead of "argmin".)

Note that $$ f^{(\mu)}(x) = f(\operatorname{prox}_{\mu f}(x)) + \frac{1}{2\mu} \| \operatorname{prox}_{\mu f}(x) - x \|_2^2. $$ This formula allows us to evaluate $f^{(\mu)}$ by evaluating the prox-operator of $f$. The gradient of $f^{(\mu)}$ can also be expressed in terms of the prox-operator of $f$.

However, this group sparsity penalty $g$ will no longer promote group sparsity effectively if you replace it with a smooth approximation, so I'd have to think carefully about how the smoothed version is going to be used.