I am working along a paper Predicting the Long term Stock Market Volatility:A GARCH MIDAS Model with Variable Selection
It uses a PLLH = LLH - L1-Norm (equation 7)
And always speaks of "non-zero" and "shrunken to zero"
We choose the optimal tuning parameter using Generalized Information Criteria (GIC), and select the variables with non zero parameter estimates
I coded it in R for a simulation study so I know the true variables and also tried a proximal gradient descent variant but that struggles to converge and takes a lot longer. So now I am using optim(method="BFGS") and it works quiet well but it never returns exact zero estimates but in the ball park of 1e-7 - 1e-13 for the non-active variables.
The gradient function is a numerical approximation of the PLLH.
Can I even achieve exact zero estimates without a threshold or did I do something wrong?
The BFGS method is a sophisticated gradient descent but a descent method nonetheless. The $1$-norm partial derivatives (outside of the axes) are equal to "sign(component)". If a component of the current solution is doomed to be non-active, it will oscillate around $0$ as it will be pushed to either side of 0 without hitting it exactly.
To sum up, this is perfectly normal and, as you hinted, a threshold should be used afterward if one wants to set non-active components to 0.