How to derive BIC for regularized linear regression model for selecting optimum hyper parameter?

240 Views Asked by At

I have a regression model with a sparse group lasso penalty with the following loss function:

$$||Y-\beta X||^2 + \lambda|\beta|_1 + \gamma \sum_g=1^g ||G_g||_2,$$ where $G_g$ is subset of $\beta$. I have implemented it in my package. Currently for model selection I do cross validation over the grid of parameters $\lambda$ and $\gamma$.

Since it gets computationally very expensive for large scale problem, I would like to use BIC or EBIC for model selection and estimating $\lambda$ and $\gamma$ Would someone can help me in driving it for this loss function and how can I apply in it practice ?

1

There are 1 best solutions below

2
On

EDIT: There is an old answer in the edit history where I did the same thing for the Lasso problem because I misread the question.

This paper goes into a prior expression for the group Lasso. For the regular Lasso, Laplace priors are placed on the regression coefficients and the MAP estimator recovers the Lasso solution. For the group Lasso problem, clusters of regression coefficients are shrunk in sync, and the prior formulation also uses a shrinkage prior to shrink multiple shared coefficients down to 0.

If the group Lasso problem is written down across $G$ different groups using the loss function:

\begin{equation} ||Y - X\beta||_2^2 + \lambda \sum_{k=1}^p |\beta_k|_1 + \gamma \sum_{g=1}^G||\beta_g||_2 \end{equation}

Then this solution can be recovered by the MAP estimator of:

\begin{equation} \begin{split} Y_i | X_i &\sim \textrm{N}(Y_i | X_i\beta, \sigma^2)\\ d\beta & \propto \exp{(-\lambda \sum_{k=1}^p |\beta_k|_1 - \gamma \sum_{g=1}^G||\beta_g||_2)} \end{split} \end{equation}

You can see this by finding the MAP estimator of the previous model which can be done by taking the negative log of the unnormalized posterior distribution:

\begin{equation} \begin{split} \pi(\beta,\lambda,\gamma,\sigma^2 | X,Y) & \propto \exp{(-\frac{||Y - X\beta||_2^2}{2\sigma^2})} \exp{(-\lambda \sum_{k=1}^p |\beta_k|_1 - \gamma \sum_{g=1}^G||\beta_g||_2)}\\ \Rightarrow \min_\theta -\log\pi(\beta,\lambda,\gamma,\sigma^2) & = \min_\theta \frac{||Y - X\beta||_2^2}{2\sigma^2} + \lambda \sum_{k=1}^p |\beta_k|_1 + \gamma \sum_{g=1}^G||\beta_g||_2 \end{split} \end{equation}

To get the BIC you need to find the maximum of the posterior, the solution to the previous minimization problem, and plug it into $\log(n)k - 2\log(\pi(\beta,\lambda,\gamma,\sigma^2 | X,Y)$, where $k$ is the number of parameters estimated, and the term $\log(n)k$ penalized the model for over-parameterization.

How you'd optimize this is a little outside my scope, but you could try a grid of $\lambda$ and $\gamma$ values, and fit the group Lasso using the regular approach on each of the grid elements. Taking the minimum of the grid would give you a BIC estimate of what the best $\lambda$ and $\gamma$ values are.

The important part of the probability model formulation as opposed to the loss function representation is that you evaluate the negative log posterior at the maximum. This is more difficult than expressing the problem because you would need to know the normalizing constant. You could probably find a number of ways to approximate this, but I don't know if anyone has solved that problem.

(Ways you might be able to evaluate the normalizing constant would be using sampling approaches, variational approximations, Laplace approximations, etc. It would probably be pretty hard to evaluate the accuracy of the non-sampling based approximation methods.)

If someone has more experience with the optimization part and the evaluation of the normalizing constant they could probably give a more thorough answer.

According to the Wikipedia page BIC struggles with variable selection problems. I know this is true in some Bayesian variable selection models, but I think it's probably unstudied whether the Group Lasso would work with BIC. It's probably worth looking into/checking for yourself if you're interested anyway.