I have an optimization problem of the form
$$\begin{align*}\text{minimize }\;&f(x)\\ \text{subject to }\;&||x||_0 \le t,\end{align*}$$
where $t$ is a given constant and $f:\mathbb{R}^d \to \mathbb{R}$ is a convex, differential function and $x$ ranges over $\mathbb{R}^d$. Or, roughly equivalently, I'd like to solve
$$\text{minimize }\; f(x) + \lambda \cdot ||x||_0.$$
What techniques are available for this?
Techniques I've found in my reading so far:
$L_1$ norm optimization: Replace the $L_0$ norm with a $L_1$ norm, and then solve the resulting problem. In other words, solve
$$\text{minimize }\; f(x) + \lambda \cdot ||x||_1.$$
Optionally, follow this with iterative re-weighting: given a candidate solution $\hat{x}$, solve
$$\text{minimize }\; f(x) + \lambda \sum_{i=1}^d \frac{1}{|\hat{x}_i|} |x_i|,$$
and then replace $\hat{x}$ with the resulting solution; iterate until convergence. Apparently, the $L_1$ norm encourages sparse solutions. (I guess this is the idea behind Lasso and compressed sensing?)
Approximate the $L_0$ norm: Replace the $L_0$ norm with $L_{1/2}$ norm, or with $L_p$ norm for $p \in (0,1)$, or with a function $g(x)$ defined by an approximation such as
$$g(x) = \sum_{i=1}^d \log(1 + |x_i|/\alpha).$$
Also, two other techniques that I haven't seen described anywhere, but seem like natural algorithms:
Forward greedy selection. Start by finding a one-element set $S_1$ that makes the following as small as possible:
$$\begin{align*}\text{minimize }\;&f(x)\\ \text{subject to }\;&x_j = 0 \text{ for all } j \notin S_1;\end{align*}$$
Then find a two-element $S_2$ such that $S_1 \subset S_2$ and that makes the following as small as possible:
$$\begin{align*}\text{minimize }\;&f(x)\\ \text{subject to }\;&x_j = 0 \text{ for all } j \notin S_2;\end{align*}$$
Iterate, adding one coefficient to the set at each stage. This minimization step can be solved efficiently by trying all possibilities for the one index you add to $S$ in each step.
Backward greedy selection. Similar to above, but you start with $S_0=\{1,2,\dots,d\}$. In the $i$th iteration you look for $S_i$ such that $S_i \subset S_{i-1}$ and $|S_i|=|S_{i-1}|-1$ and that makes the following as small as possible:
$$\begin{align*}\text{minimize }\;&f(x)\\ \text{subject to }\;&x_j = 0 \text{ for all } j \notin S_i;\end{align*}$$
Are there any other techniques worth knowing about? Are there any "dominance" relationships between these (e.g., method X usually beats method Y, so don't bother with method Y)?
I know that the L0 norm $||\cdot||_0$ isn't convex, and in fact, isn't even a norm. I know the optimization problem I'm trying to solve is NP-hard in the worst case, so we can't expect efficient solutions that always produce the optimal answer, but I'm interested in pragmatic heuristics that will often work well when $f(x)$ is nice (smooth, etc.).
You should have a look at a nice package called Smoothed $ {L}_{0} $ (Smoothed L0 / Smoothed L Zero).
They approximate the $ {L}_{0} $ "Pseudo Norm" (Which isn't convex) by a Gaussian Kernel.
The idea is iterating on the parameter defining the kernel (Warm Start).
It seems to work nicely and fit your problem.