Linear Least Squares with $ {L}_{2} $ Norm Regularization / Penalty Term

785 Views Asked by At

I want to minimize the following cost function

$$J= \displaystyle\min_{\boldsymbol{\rm x}} \left\{\|A\boldsymbol{\rm x}_\lambda-b\|_2^2 + \lambda\|\boldsymbol{\rm x}_\lambda\|_2\right\}$$

It is a bit different from Tikhonov regularization because the penalty term is not squared. As opposed to Tikhonov, which has an analytic solution, I was not able to find such solution for $J$. Could someone give me some hint which method I can use to solve for $\boldsymbol{\rm x}_\lambda$?

1

There are 1 best solutions below

2
On BEST ANSWER

You can solve this problem easily using the proximal gradient method or an accelerated proximal gradient method such as FISTA. This is a powerful tool that everyone can easily learn to use. The proximal gradient method solves optimization problems of the form $$ \text{minimize} \quad f(x) + g(x) $$ where $f$ and $g$ are convex and $f$ is differentiable (with a Lipschitz continuous gradient) and $g$ is "simple" in the sense that its proximal operator can be evaluated efficiently. The proximal operator of $g$ is defined as follows: $$ \tag{$\spadesuit$} \text{prox}_{tg}(\hat x) = \arg \, \min_x \quad g(x) + \frac{1}{2t} \| x - \hat x \|_2^2. $$ The proximal operator of $g$ reduces the value of $g$ without straying too far from $\hat x$. The parameter $t > 0$ can be thought of as a "step size" that controls how far away from $\hat x$ we are allowed to move. Note that $g$ is not required to be differentiable. (But $g$ is required to be lower-semicontinuous, which is a mild condition that is usually satisfied in practice.)

The proximal gradient method iteration is $$ x^{k+1} = \text{prox}_{tg}(x^k - t \nabla f(x^k)). $$ What is happening here is that, starting from the current location $x^k$, we take a gradient descent step to reduce the value of $f$, then we apply the prox-operator of $g$ to reduce the value of $g$. The step size $t > 0$ is required to satisfy $t \leq 1/L$, where $L$ is a Lipschitz constant for $\nabla f$. (There is also a line search version of the proximal gradient method that might converge faster. In the line search version the step size $t$ is chosen adaptively at each iteration.)

Your optimization problem has the form $(\spadesuit)$ where $$ f(x) = \| Ax - b \|_2^2 $$ and $$ g(x) = \lambda \|x\|_2. $$ The gradient of $f$ can be evaluated using calculus: $$ \nabla f(x) = 2 A^T(Ax - b). $$ There is a nice formula for the proximal operator of the $\ell_2$-norm: $$ \text{prox}_{tg}(x) = \text{prox}_{t \lambda \| \cdot \|_2}(x) = x - P_B(x), $$ where $B$ is the $2$-norm ball of radius $t \lambda$ centered at the origin and $P_B(x)$ is the projection of $x$ onto $B$. (A formula for the proximal operator of a norm can be found for example in Vandenberghe's UCLA 236c notes. See chapter 8 "The proximal mapping", slide 8-3.)

With these formulas, you are ready to implement the proximal gradient method to solve your optimization problem using just a few lines of code. There is also an accelerated version of the proximal gradient method known as FISTA which is equally simple and likely converges faster.