How does tensorflow make sure some of the regularized parameters will stick to 0s, when L1 regularization applied?

46 Views Asked by At

I am using a subset of network parameters $\theta = \{\theta_i\}_i$ to do feature selection of input. The loss function thus reads as $L(\theta) = L_0(\theta) + \lambda\|\theta\|_{L_1}$.

I found once some $\theta_i$ reach around $0$ (numerically, can be $1e^{-3}$ something), they stop around $0$, and will not turn back having significant values.

I don't see any reasons why numerical zero parameters stop moving. Because their gradient is $\partial_{\theta}L = \partial_{\theta}L_0 + \lambda sign(\theta)$. Considering 1. $\theta$ never equals to $0$ exactly, hence, there'll always be $+\lambda$ or $-\lambda$ in gradient, 2. $\partial_{\theta}L_0$ is likely to be non-zero, thus at each iteration, the non-zero gradient will force numerical zero parameters to move.

I am using tensorflow to implement to network with AdamOptimizer.

1

There are 1 best solutions below

6
On BEST ANSWER

I can't be certain without more details, but I can think of a few potential reasons.

  1. The $L_1$ norm penalty (Lasso) is a sparsifying penalty, meaning it pushes each entry of $\theta$ towards zero. On interpretation of this is that $L_1$ is the best convex relaxation of the $L_0$ penalty, which counts the number of non-zero parameters (see here). In comparison to classic $L_2$ weight decay, in some sense the $L_1$ penalty is "equally harsh" to both large and small values, whereas $L_2$ cares less and less as $|\theta_i|$ decreases (since small numbers squared get even smaller). See here and here for more information about the sparsification effects of $L_1$.

  2. There is actually a good reason for "the zero parameters to stop moving" that is more particular to neural networks. Firstly, note that $\partial_\phi \mathcal{L}$ for some specific scalar weight $\phi$ measures (as you likely know) the effect of infinitesimally perturbing $\phi$ on the loss $\mathcal{L}$. Many networks are filled with activation functions (also called pointwise non-linearities) that can destroy the gradients of the weights within particular regimes. For instance, consider a weight matrix $W$, with a simple neural layer $f_\ell(x)=\text{ReLU}(Wx)$, and suppose $Wx < -\delta^2$ for some non-zero $\delta$. Then $f_\ell(x) = 0$ and $\partial_W f_\ell(x) = 0$ because a minor perturbation within the zeroed domain of ReLU cannot move it out of that domain! This is one reason why other activations, like leaky ReLU, are sometimes used. In other words, once a neuron enters the "dead zone" of the activation function (across a decent area of input data space $x$), it may be difficult to leave it (i.e., become non-zero). This problem is very well-known as the "Dead ReLU Problem"(see here or here, for example). Of course, for people who want a sparse network (e.g., for computational efficiency or compression), this may be useful.

    Ok but what if you are not using ReLU? Other activations have this issue as well (worse actually). The classical sigmoid and Tanh non-linearities, for instance, have large swathes of their domain over which the derivative is very small. This means that, if a neuron gets trapped (i.e., gives values mostly) in those parts of the input space, then it is nearly impossible for it to leave, because the derivatives of the non-linearity essentially destroy the back-prop gradients there. This issue is the source of the classic "Vanishing gradient problem" (see here, here, or here).

  3. Indeed, I suspect this is not entirely because of activations alone. In a network with many millions of parameters, using a small learning rate, for a weight that is already very small, then it is not surprising that a weight near-zero would stay there: it simply won't have a large effect on the output because it is so small. Something like Adam won't help, in fact it might be worse, because it tracks the past gradient values (so small gradients will actually shrink the gradients of future updates as well). These small values are pushed towards zero by the $L_1$ penalty. I suspect that your learning rate $\eta$ (which is usually $10^{-3}$ or smaller), multiplied with the small effect of a small parameter (potentially also affected by the activation function), altogether create a weight update $\Delta_\phi = \eta\nabla_\phi\mathcal{L}$ which is quite small. In other words, once the weight shrinks to something small, your network finds it easy to just ignore it and use the other parameters instead.