Ridge regression to minimize RMSE instead of MSE

432 Views Asked by At

Given a metrix $X$ and a vector $\vec{y}$, ordinary least squares (OLS) regression tries to find $\vec{c}$ such that $\left\| X \vec{c} - \vec{y} \right\|_2^2$ is minimal. (If we assume that $\left\| \vec{v}\right\|_2^2=\vec{v} \cdot \vec{v}$.)

Ridge regression tries to find $\vec{c}$ such that $\left\| X \vec{c} - \vec{y} \right\|_2^2 + \left\| \Gamma \vec{c} \right\|_2^2 $ is minimal.

However, I have an application where I need to minimize not the sum of squared errors, but the square root of this sum. Naturally, the square root is an increasing function, so this minimum will be at the same location, so the OLS regression will still give the same result. But will ridge regression?

On the one hand, I don't see how minimizing $\left\| X \vec{c} - \vec{y} \right\|_2^2 + \left\| \Gamma \vec{c} \right\|_2^2 $ will necessarily result in the same $\vec{c}$ as minimizing $\sqrt{ \left\| X \vec{c} - \vec{y} \right\|_2^2 } + \left\| \Gamma \vec{c} \right\|_2^2 $.

On the other hand, I've read (though never seen shown) that minimizing $\left\| X \vec{c} - \vec{y} \right\|_2^2 + \left\| \Gamma \vec{c} \right\|_2^2 $ (ridge regression) is identical to minimizing $\left\| X \vec{c} - \vec{y} \right\|_2^2$ under the constraint that $ \left\|\Gamma \vec{c}\right\|_2^2 < t$, where $t$ is some parameter. And if this is the case, then it should result in the same solution as minimizing $\sqrt{ \left\| X \vec{c} - \vec{y} \right\|_2^2}$ under the same constraint.

2

There are 2 best solutions below

0
On

From the answer on stats.stackexchange:

Minimizing the regularized MSE is equivalent to solving the MSE under some condition. Minimizing the regularized RMSE is equivalent to solving the RMSE under a different condition. Hence, the two solutions will not necessarily be identical.

0
On

It is worthwhile to clarify the answer in the context of regression. For any practical consideration the solution is indeed the same. By minimizing $\| Y - cX\|_2^2$ or $\| Y - cX\|_2$ you are looking for a vector $c$ that minimizes the defined loss function subject to some constraint $ \|\Gamma c \|_2^2 \le t$. In Lagrange form the problems reads $$ \arg\min \mathcal{L}(\lambda, c) =\arg \min \left( \| Y - cX\|_2 ^ 2+ \lambda(\| \Gamma c\|_2^2 - t) \right). $$
Note that $\lambda$ is a "dummy" parameter, or the so called "regularization parameter". You are searching (numerically) for such a lambda that minimizes your defined loss function (MSE or RMSE), and you don't care about its values as is. As such, the optimal $c$ will be the same while $\lambda$ will change according to the specification of the loss function. Any other combination is irrelevant or incorrect. I.e., setting $\lambda$ as constant and searching for the optimal $c$ is statistically meaningless or minimizing MSE but choosing $\lambda$ according to RMSE loss function is simply incorrect and has no sense. As such, there is no reason to discuss these cases.

Formally, let us take $\Gamma = I$ for the sake of convenience. By minimizing MSE your gradient (F.O.C) is $$ 2X'(Y - Xc) + 2\lambda c = 2X'Y - c( 2X'X + I\lambda) = 0, $$ so the Ridge estimator of $c$ is $$ \hat{c}(\lambda) = (X'X + \lambda I)^{-1} X'y. $$ If instead you are minimizing the RMSE then gradient is

$$ \frac{X'(Y - Xc)}{ \|Y-Xc\|} + 2\lambda c = 0, $$ or $$ X'(Y - Xc) + 2\lambda \|Y-Xc\| c = X'Y - c(X'X + 2 \lambda \|Y - Xc\|) = 0, $$ that is non-linear in $c$, hence has no close form solution. However, as square root is monotone and one-to-one transformation, the solution w.r.t $c$ will be the same as for $\| \cdot \|_2^2$. where $\lambda$ will change according to the transformation you appllied.