Necessity of standardizing data in regularized regression.

56 Views Asked by At

It is well known that in Ridge or LASSO regression we add a regularization term to penalize large regression coefficients. What if the true relationship between the response and covariates relies on a large coefficient? Let's say, the true relation is given by $y=2.5x_1+1.5x_2+200x_3$, where the third term will get unfairly penalized in regression for its large but true coefficient. One way to avoid that is to standardized the data. But is that always necessary? Thanks in advance!

2

There are 2 best solutions below

0
On

I will take ridge regression as an example. In ridge regression, we have the objective function

$$E(\boldsymbol{w})=\sum_{n=1}^{N}\left[y_i-\boldsymbol{w}^T\boldsymbol{x}_i\right]^2+\lambda\boldsymbol{w}^T\boldsymbol{w}.$$

As you can see we have the coefficient in both expressions. If your regularization is very very high we can asymptotically estimate the objective function as

$$E_{\lambda \,\gg\, 1}(\boldsymbol{w}) \sim \lambda\boldsymbol{w}^T\boldsymbol{w}$$ the absolute value of the regression weights will tend to zero. Hence, optimization will lead to $||\boldsymbol{w}||\to 0$. But we neglected the sum of squared errors. Note that we still have the weighted sum of squared errors. If one of the regression weights is important in this sum ridge regression will also give this weight a higher weight in comparison to other weights. If you regularize to heavy the regression will kill all the weights.

You can use cross-validation to ensure that your regularization is in a reasonable region. Remember that we never know what the true relationship is. Hence, we need to use empirical data to estimate these weights. And if we see that regularization hurts our validation metrics we might also consider to not use any regularization at all by setting $\lambda = 0$.

0
On

Yes, if the penalty is Lasso then $\sum_{i=1}^p |\beta_i| \le k$, so first you have to center the data. Namely, if you have an intercept term $\beta_0$ and you don't get rid of it, you will penalize it too and as it can be very large it can distort all you analysis. Same for the coefficients of $x$'s, large $\beta$s are result of large $y$s, it have nothing to do with the "strength" of the association between $x$ and $y$, hence by not-scaling, you will heavily penalize large $\beta$s without a justification. E.g., the historical package in R-language that performs Lasso and Ridge, automatically standardize your data before applying the algorithms. You cannot use either of the methods without scaling the data first as the optimal penalty $\lambda$ depends on the data, thus it can be unnecessarily large just because your data measured in some large or small scale.