The formula for the ridge regression coefficients is
$$ \beta = {X^{\top}Y}({X^{\top}X+\lambda I})^{-1}$$
I have tried to derive it as follows:
- The loss is (I am omitting the sum before the square term):
$$ L = {(X^{\top}\beta-Y)^2+\lambda\beta^{\top}\beta} $$
- Take the derivative and equating to zero:
$$ \frac{dL}{d\beta} = 2({X^{\top}\beta-Y)X}+2{\lambda \beta}$$
- Equating the derivative to zero (we can get rid of the multipliers $2$) and expanding:
$$ X^{\top} X\beta- X^{\top} Y +\lambda \beta = 0 $$
- Isolating beta:
$$ \beta ( X^{\top} X + \lambda) = X^{\top} Y$$
Here in the 4th step, I should somehow arrive at a formula that has $\mathbf I$ right after $\lambda$. Where I am making a mistake? I know it must be there but I don't understand why it should be there or based on what rule I should add it after $\lambda$. Thanks.
There's no rule that says that $X^T X \beta + \lambda \beta = (X^T X + \lambda) \beta$. The expression on the right would not make sense, because we would be adding a matrix $X^T X$ to a scalar $\lambda$.
However, we can say that $X^T X + \lambda \beta = X^T X + \lambda I \beta = (X^T X + \lambda I) \beta$.
By the way, I would write the full calculation like this. Our goal is to minimize the function $$ f(\beta) = \frac12 \| X \beta - Y \|^2 + \frac{\lambda}{2} \| \beta \|^2. $$ Using the multivariable chain rule, we see that the derivative of $f$ is $$ f'(\beta) = (X \beta - Y)^ T X + \lambda \beta^T. $$ So the gradient of $f$ is $$ \nabla f(\beta) = f'(\beta)^T = X^T(X \beta - Y) + \lambda \beta. $$ Setting $\nabla f(\beta) = 0$, we obtain $$ (X^T X + \lambda I) \beta = X^T Y. $$