I am trying to define an optimization problem: $$ \min_\theta \sum_{x\in X} L(x, \theta, f_\theta(x),\delta_\theta(x)) + \lambda_s S(\delta_\theta(x)) $$ where $X$ is the dataset, $\theta$ are the parameters of the function being optimized ("learned"), $L$ is a loss function, and $S$ is some kind of differentiable sparsity-encouraging penalty, which is what I am trying to determine.
Essentially, my network has two outputs: $f_\theta(x)$ and $\delta_\theta(x)\in\mathbb{R}^n$, and I know that $\delta_\theta(x)$ should be quite sparse. It's sparsity level will depend on $x$. My question is how to encourage this in the cost function (within a machine learning context).
A common sparsifying penalty is the $L_1$ norm (a relaxation of the $L_0$ norm), which is fine, but I don't really want to shrink or restrict the non-zero elements. (Related: [1], [2], [3]). The loss $L$ is already very complicated (non-convex, but differentiable), so my feeling is that the convexity of $L_1$ is not terribly important.
Another penalty (e.g. here) is to use $$ S(\delta) = \sum_i \log(1 + |\delta_i|/\xi) $$ and the smoothed $L_0$ penalty: $$ S(\delta) = -\sum_i \exp(-\delta_i/\sigma) $$ for some small $\sigma$ (which can be iteratively tuned).
Questions:
Besides needing to tune $\sigma$, is there any reason to use $L_1$ versus the two $S(\delta)$ functions above?
Why don't I see the $L_p$ "norms" with $p<1$ (at least not that I've seen)? E.g. $S_{1/2}(\delta) = \sum_i \sqrt{\delta_i}$?