I have just learned that a general framework in constrained optimization is called "proximal gradient optimization". It is interesting that the $\ell_0$ "norm" is also associated with a proximal operator. Hence, one can apply iterative hard thresholding algorithm to get the sparse solution of the following
$$\min \Vert Y-X\beta\Vert_F + \lambda \vert \beta \vert_0$$
If so, why people are still using $\ell_1$? If you can just get the result by non-convex optimization directly, why are people still using LASSO?
I want to know what's the downside of the proximal gradient approach for $\ell_0$ minimization. Is it because of the non-convexity and randomness associated with? That means the initial estimator is very important.
This following points are from optimizaiton perspective.
With $\ell_0$ norm, you can obtain closed form solution for proximal mapping. But, the original problem is still non-convex and convergence gaurantees of iterative hard thresholding exist upto to stationary points only (as far as I know). Moreover, with $\ell_0$ norm, I think it is quite easy to get stuck in bad local minima and the inertial algorithms do not behave well either. The algorithm can be very sensitive to initialization aswell.
With $\ell_1$ norm, you have the opportunity to reach global minimum. Moreover, proximal mapping is available in closed form solution (which is much better than plain subgradient descent). Accelerated variants such as FISTA can also be applied for further speedups.
It appears that the following reference addresses the issues regarding the $\ell_0$ norm: T. Blumensath - Iterative Hard Thresholding: Theory and Practice.