LASSO relationship between Lagrange multiplier and constraint and why it doesn't matter

932 Views Asked by At

My understanding of LASSO regression is that the regression coefficients are selected to solve the minimisation problem:

$$\min_\beta \|y - X \beta\|_2^2 \ \\s.t. \|\beta\|_1 \leq t$$

In practice this is done using a Lagrange multiplier, making the problem to solve

$$\min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 $$

What is the relationship between $\lambda$ and $t$? Wikipedia unhelpfully simply states that is "data dependent".

Why do I care? Firstly for intellectual curiosity. From having searched on this site and others I can't find any explicit discussion of this link.

Beyond being curious, I am concerned about the consequences for selecting $\lambda$ by cross-validation.

Specifically, if I'm doing n-fold cross validation, I fit n different models to n different partitions of my training data. I then compare the accuracy of each of the models on the unused data for a given $\lambda$. But the same $\lambda$ implies a different constraint ($t$) for different subsets of the data (i.e., $t=f(\lambda)$ is "data dependent").

Isn't the cross validation problem I really want to solve to find the $t$ that gives the best bias-accuracy trade-off?

I can get a rough idea of the size of this effect in practice by calculating $\|\beta\|_1$ for each cross-validation split and $\lambda$ and looking at the resulting distribution. In some cases the implied constraint ($t$) can vary quiet substantially across my cross-validation subsets. Where by substantially I mean the coefficient of variation in $t>>0$.

So why is it OK to find an optimal $\lambda$, where the amount of regularisation implied by $\lambda$ depends on how I slice the data, instead of finding explicitly the optimal amount of regularisation (i.e., $t$)?

1

There are 1 best solutions below

0
On

The relationship between $\lambda$ and $t$ is given via the KKT conditions; specifically, that of complementary slackness, which states that

$$\lambda(\|\widehat{\beta}\|_1 - t) = 0,$$

where $\widehat{\beta}$ is the solution to the LASSO problem (and is a function of $\lambda$ itself). Thus, for $\lambda$ nonzero (i.e. the LASSO solution is not just the OLS estimator), we have that $\|\widehat{\beta}\|_1 = t$ to be the connection between $\lambda$ and $t$.

Unfortunately, this is likely the most that can be said in general, as the LASSO estimator does not have a closed form solution. However, the connection can be made more explicit outside the high-dimensional setting where $n \geq p$.

In this setting, suppose that $X$ is orthogonal (otherwise just use the QR decomposition of $X$ and perform the change of variables $\gamma = R\beta$; we then have $X\beta = Q\gamma$ so we can equivalently perform LASSO to estimate $\gamma$ with orthogonal design $Q$). Then we have a closed-form solution for each component of the LASSO estimator: $$\widehat{\beta}_i = \operatorname{sign}(\beta^{OLS}_i)\cdot(|\beta^{OLS}| - n\lambda)^+.$$

By the KKT condition, we have the relationship $$\sum_{i=1}^n(|\beta^{OLS}_i| - n\lambda)^+ = t.$$

Now the relationship is easily seen: As $\lambda$ increases, more components of the sum are set to zero, so $t$ must decrease, and vice versa. The exact relationship depends on $n$ and the OLS estimator $\beta^{OLS} = X^\top y$. However, so long as $X$ and $y$ are given, one can use the above formula to solve exactly for the relationship between $\lambda$ and $t$.