Where does the identity matrix come from in the formula for ridge regression coefficients?

636 Views Asked by At

The formula for the ridge regression coefficients is

$$ \beta = {X^{\top}Y}({X^{\top}X+\lambda I})^{-1}$$

I have tried to derive it as follows:

  1. The loss is (I am omitting the sum before the square term):

$$ L = {(X^{\top}\beta-Y)^2+\lambda\beta^{\top}\beta} $$

  1. Take the derivative and equating to zero:

$$ \frac{dL}{d\beta} = 2({X^{\top}\beta-Y)X}+2{\lambda \beta}$$

  1. Equating the derivative to zero (we can get rid of the multipliers $2$) and expanding:

$$ X^{\top} X\beta- X^{\top} Y +\lambda \beta = 0 $$

  1. Isolating beta:

$$ \beta ( X^{\top} X + \lambda) = X^{\top} Y$$

Here in the 4th step, I should somehow arrive at a formula that has $\mathbf I$ right after $\lambda$. Where I am making a mistake? I know it must be there but I don't understand why it should be there or based on what rule I should add it after $\lambda$. Thanks.

3

There are 3 best solutions below

0
On BEST ANSWER

There's no rule that says that $X^T X \beta + \lambda \beta = (X^T X + \lambda) \beta$. The expression on the right would not make sense, because we would be adding a matrix $X^T X$ to a scalar $\lambda$.

However, we can say that $X^T X + \lambda \beta = X^T X + \lambda I \beta = (X^T X + \lambda I) \beta$.


By the way, I would write the full calculation like this. Our goal is to minimize the function $$ f(\beta) = \frac12 \| X \beta - Y \|^2 + \frac{\lambda}{2} \| \beta \|^2. $$ Using the multivariable chain rule, we see that the derivative of $f$ is $$ f'(\beta) = (X \beta - Y)^ T X + \lambda \beta^T. $$ So the gradient of $f$ is $$ \nabla f(\beta) = f'(\beta)^T = X^T(X \beta - Y) + \lambda \beta. $$ Setting $\nabla f(\beta) = 0$, we obtain $$ (X^T X + \lambda I) \beta = X^T Y. $$

0
On

Here we are dealing with matrices and $\lambda$ is a scalar. So we can't add $X^TX$ with $\lambda$.

Also we can write $\beta = I\beta$. So, if we isolate $\beta$ we'll automatically have I inside the parentheses.

0
On

You correctly obtained $X^TX\beta+\lambda\beta=X^TY$, which has $i$th component $(X^TX)_{ij}\beta_j+\lambda\beta_i=(X^TY)_i$. Obviously, we mustn't conflate $\beta_i$ with $\beta_j$. But we can write the left-hand side in terms of $\beta_j$ only, using the Kronecker delta $\delta_{ij}$: $$[(X^TX)_{ij}+\lambda\delta_{ij}]\beta_j=(X^TY)_i.$$The $\beta_j$ coefficient is $(X^TX+\lambda I)_{ij}$. If $X^TX+\lambda I$ is invertible, $$\beta=(X^TX+\lambda I)^{-1}X^TY.$$