I am studying applications of Ridge Regression in Machine Learning and while reading the Hoerl and Kennard paper on Ridge Regression I came a cross an ambiguity I don't understand. They refer to the matrices that are "not nearly" unit matrices and thus are bad candidates for ordinary least squares but good candidates for ridge regression.
The solution to the ordinary least squares regression is:
$\beta = (X'X) ^ {-1}X'y$
While the solution to the Ridge Regression is:
$\beta = (X'X+I\lambda) ^ {-1}X'y$
As $X'X$ varies further from a unit matrix, the OLS solution is less reliable. In this case when they use the term "unit matrix" are they referring to...
1) The identity matrix
2) An (n x n) matrix of all ones
3) Or the (n x n) matrix $A=X'X$ where some (n x n) matrix $B$ exists such that $AB=BA=I$
I think the answer is 3 but I can't determine this for sure.
Here's a link to the paper: https://pdfs.semanticscholar.org/910e/d31ef5532dcbcf0bd01a980b1f79b9086fca.pdf
In my opinion, the "unit matrix" refers to the identity matrix $I_{p\times p}$, where $p$ is the number of variables.
If $X' X$ is far away from $I$, or more specifically, if the condition number of $X' X$ is very large, then the OLS solution is unstable and some forms of smoothing (e.g., Ridge regression) are typically needed.