I'm reading Deep Learning (Ian Goodfellow e Yoshua Bengio) and I'm stuck in this section. The authors try to show how the $L^2$ norm regularization impact on a simple linear model.
Reducing the sum of squared error with the addition of $L^2$ regularization leads to:
$$w = (X^TX + \alpha I)^{-1}X^Ty$$
The matrix $X^TX$ is proportional to the variance covariance matrix of the data ($\frac{1}{m}X^TX$). The authors continue stating "The diagonal entries of this matrix $(X^TX)$ corresponds to the variance of each input feature. We can see that $L^2$ regularization causes the learning algorithm to perceive the input $X$ as having higher variance, which makes it shrink the weights on features whose covariance with the output is low compared to this added variance".
I understand why the $L^2$ regularization increases by $\alpha$ the variance perceived but why this should reduce the weights whose covariance with the output is lower to this added variance? Sorry for my bad english, I hope the question is clear. Both qualitative and quantitative explanations are well accepted.