I'm supposed implementing certain optimization algorithms (ISTA, FISTA) to minimize: $$\frac12 ||Ax-(Ax_0+z)||_2^2 + \lambda ||x||_1.$$
$A$ is a matrix, $x$ is a vector, $z$ is some noise filled with random data from a certain distribution. I get that.
$\lambda$ is supposed to be chosen so as to "yield a sparse solution". What does that mean exactly? Isn't the L1 norm itself supposed to help find the sparsest solution? How do I pick my lambda to improve on that? What factors influence what the best $\lambda$ should be?
It's not hard to show (Hint: show that the dual problem is a projection of $\frac{1}{\lambda}(Ax_0 + z)$ onto the polyhedron $\mathcal P := \{\theta \text{ s.t }\|A^T\theta\|_\infty \le 1\}$, and then use the KKT conditons ...) that if $\lambda \ge \lambda_{\text{max}} := \|A^T(Ax_0 + z)\|_\infty$, then the solution of your problem is the only zero vector. As you decrease $\lambda$ from $\lambda_{\text{max}}$ to $0^+$ (i.e as you descend the Lasso path), roughly speaking, more and more indices enter the support of a solution (i.e solutions become less sparse).
You want to check this soft-hand document on the subject. Jump to section 3.
Now, which particular value of $\lambda$ is "good" depends on your application and what you're trying to do. This problem is known as model selection, and spans an entire sub-field in statistical learning.
A simple and principled way to go about it is via cross-validation: you form a finite grid $\mathcal G$ of $\lambda$ values say, $10^{-k}\lambda_{\text{max}}$ for $ k = 0, 1, \ldots, N$ (with $N \sim 10$, say). This gives your $N$ models to compare. Then you use something like $K$-fold cross-validation (say, with $K = 8$) to measure the relative performance of these models, and select the best based on these scores.
When computing / fitting these $N$ Models (we say you're computing a regularization path), you'd start from the largest to the smallest value in $\mathcal G$, and each time you're warm-start the problem of fitting the current model, with the solution of the previous one.
Amongst the many ways of computing the regularization path, the LARS-lasso algorithm is particularly cool. Coordinate descent is also an awesome method. There are cutting-edge implementations of these methods (including the cross-val thing) out there. For example, checkout scikit-learn or R's glmnet package.
N.B.: $\|a\|_\infty := \max_{u \text{ s.t }\|u\|_1 \le 1}a^Tu = \max_{j}|a_j| = \lim_{p \rightarrow \infty}\|a\|_p$.