Suppose I'm performing linear regression. My lecturer said the formula below can be used for estimating the weight vector that is passed to the L2-norm part of the loss function but he didn't elaborate. I have 2 questions. When is it a good idea to do so and why? And if I am to do the gradient descent manually do I have to update these weights in addition to the "normal" ones?
$$w = (X^TX + λI_p)^{-1}X^Ty$$
X - design matrix;
y - vector from training data (x,y)
$I_p$ - identity matrix where p is the dimension of the weight vector
λ - regularization coefficient
This is called the Ridge regression. The original motivation for using Ridge regression was to deal with complete and high colinearity between the explained variables. The larger the $\lambda$ the more stable model you will have and less information will be used from the design matrix $X^TX$. Another effect of the so-called $l_2$ regularization is shrinkage of the weights $w$. The largest the $\lambda$ the smaller the weights. This introduces bias, but reduces variance (as part of the bias-variance trade-off). Regarding the second question - there is a close form solution, $(X'X - \lambda I )^{-1} X'y$, no need in gradient descent.