Obtaining unique solution to under-determined least-squares problem

304 Views Asked by At

Given an under-determined problem ($n<p$), I know lasso and tikhonov regularization are a couple of approaches that converts the least-squares minimization into a convex problem. While I understand mathematically why this is true, are we essentially solving the same problem by solving $min \|Ax-b\|_2^2 +\lambda \|x\|^2_2 $ instead of $min \|Ax-b\|_2^2 $? Also, the choice of $\lambda$ seems arbitrary here?

1

There are 1 best solutions below

0
On BEST ANSWER

Question 1:

The underdetermined problem has infinitely many solutions, adding the penalty points you to one specific one. (Indeed by making the problem convex)

To see this clearly, let's assume that $x^\star$ is such that $\|Ax^\star-b\|_2=\epsilon$ is the lowest error norm that could be achieved. Then $x^\star+\tilde x$ where $\tilde x\in ker A$ clearly leads to the same solution but the norm of the solution is modified $\| x^\star + \tilde x\|$.

So it's true that you are not solving the same problem, but now the problem can be solved efficiently using standard algorithms.

Question 2:

$\lambda$ is a parameter that tells you the tradeoff between how much you think that $x$ should be regularized versus how much you think it should stick to the data. If $\lambda$ is very large, you will constrain $x$ a lot, its norm will be very small but the data fitting error $\|Ax-b\|$ may be very large. This would be ok for very noisy data but at some point you may lose information.

An example of application of regularized problem which may help to gain intuition is the case of imaging. A standard problem is as follows:

  • you get a noisy image (which you can model as $Y=X+\epsilon$ where $Y,X,\epsilon$ are matrices, $Y$ is the image (pixel values) you can see, $X$ is the original image that you can't see and $\epsilon_{ij}$ model the noise)
  • you want to recover $\hat X =f(Y)$

there are many $\hat X$ that may lead to the exact same "data error" $\|X-Y\|$ but some will "look better", to help the algorithm to find those ones, you can consider a wide range of penalties such as

  • an $\ell^2$ norm (will smooth the image)
  • a TV norm (will render the edges sharply)
  • ...

and, again, the hyperparameter $\lambda$ will represent how noisy you believe the input $Y$ is and how much the "structure" should be enforced. Small $\lambda$ indicates a little bit of structure and a lot of data fidelity (low noise case), high $\lambda$: lots of structure, not so much data fidelity (high noise case).

The selection of $\lambda$ is more an art than a science but there are some heuristics (some use a function of the estimated variance, some use a full Bayesian model for it which essentially hides away the fact that they don't know how to set it, some use grid search and pick the one that seems to give them a good result, ...)