Proof regarding Ridge and Lasso regularization

127 Views Asked by At

I've a problem with understanding this exercise. Would be very happy to receive a little help here. Thanks![enter image description here]1

1

There are 1 best solutions below

1
On BEST ANSWER

You have some data $\mathcal{D} = \{ (x_i, y_i) \}_{i=1,\ldots, n}$ but instead of a model $f(x) = \beta x$ we are duplicating the predictor variable. You can imagine it like we take the original dataset, e.g $$\mathcal{D} = \{ (1,2), (4,8), (7, 14) \}$$ and duplicate the $x_i$ to get $$\mathcal{D}' = \{ (1,1,2) , (4,4,8), (7,7,14) \}$$

A linear model for $\mathcal{D}'$ would look like $f(x_1, x_2) = \beta_1 x_1 + \beta_2 x_2.$ Since we know $x_1 = x_2 = x$ the linear model is more simply written as $f(x) = \beta_1 x + \beta_2 x.$ The RSS for the linear model is $ \sum_i | y_i - f(x_i) |^2 = \sum_i | y_i - (\beta_1+\beta_2) x_i |^2.$ The ridge regression penalty on such a model is $\lambda(\beta_1^2 + \beta_2^2)$ and the lasso penalty is $|\beta_1| + |\beta_2|.$

a)

The loss in the Ridge regression model is $$L(\beta_1, \beta_2) = \sum_i | y_i - (\beta_1 + \beta_2) x_i|^2 + \lambda (\beta_1^2 + \beta_2^2)$$

Now suppose that $\hat{\beta_1}, \hat{\beta_2}$ optimize the loss. Using the fact that $0 \leq (x-y)^2$ with equality if and only if $x=y,$ you should verify that $$L\left( \frac{ \hat{\beta_1} + \hat{\beta_2} }{2}, \frac{ \hat{\beta_1} + \hat{\beta_2}}{2} \right) \leq L(\hat{\beta_1}, \hat{\beta_2})$$ with equality if and only if $\hat{\beta_1} = \hat{\beta_2}.$ Note that we must have equality, since by assumption $L(\hat{\beta_2}, \hat{\beta_2})$ is minimal. So we see that the optimal solution always has $\hat{\beta_1} = \hat{\beta_2}.$

b)

The loss in the Lasso regression model is $$L(\beta_1, \beta_2) = \sum_i | y_i - (\beta_1 + \beta_2) x_i|^2 + \lambda (|\beta_1| + |\beta_2|)$$ and you can see for yourself why for a given $\beta,$ all $\beta_1, \beta_2$ such that $\beta_1 + \beta_2 = \beta$ and having the same sign yields the same loss function, so there are an infinite number of pairs $(\hat{\beta_1}, \hat{\beta_2})$ which optimize the loss function. A concrete example of this statement is that the linear model $f(x_1, x_2) = 2x_1 + 3x_2$ has the same RSS and same Lasso penalty as $f(x_1, x_2) = 3x_1 + 2x_2,$ because in this problem $x_1 = x_2 = x.$

There is a major lesson to take from this exercise. It is quite common to see someone perform a Lasso regression and interpret the optimal parameters of the model as measures of how important a particular feature is in predicting the target. As we see from this example, if linear relationships exist between the features then the parameters can not be interpreted in that way.