Ridge Regression: Unit Matrix (Hoerl and Kennard 1970)

213 Views Asked by At

I am studying applications of Ridge Regression in Machine Learning and while reading the Hoerl and Kennard paper on Ridge Regression I came a cross an ambiguity I don't understand. They refer to the matrices that are "not nearly" unit matrices and thus are bad candidates for ordinary least squares but good candidates for ridge regression.

The solution to the ordinary least squares regression is:

$\beta = (X'X) ^ {-1}X'y$

While the solution to the Ridge Regression is:

$\beta = (X'X+I\lambda) ^ {-1}X'y$

As $X'X$ varies further from a unit matrix, the OLS solution is less reliable. In this case when they use the term "unit matrix" are they referring to...

1) The identity matrix

2) An (n x n) matrix of all ones

3) Or the (n x n) matrix $A=X'X$ where some (n x n) matrix $B$ exists such that $AB=BA=I$

I think the answer is 3 but I can't determine this for sure.

Here's a link to the paper: https://pdfs.semanticscholar.org/910e/d31ef5532dcbcf0bd01a980b1f79b9086fca.pdf

2

There are 2 best solutions below

0
On

In my opinion, the "unit matrix" refers to the identity matrix $I_{p\times p}$, where $p$ is the number of variables.

If $X' X$ is far away from $I$, or more specifically, if the condition number of $X' X$ is very large, then the OLS solution is unstable and some forms of smoothing (e.g., Ridge regression) are typically needed.

0
On

The ridge regression problem may be stated as $$ \begin{equation}\tag{1} \underset{\beta}{\text{minimize}}\quad||y-X\beta||_2^2+\lambda||\beta||_2^2 \triangleq f(\beta) \end{equation} $$ Thus, the variable $\beta$ is penalized for being too big in the two norm. Differentiating $f(\beta)$ with respect to $\beta$ and setting it equal to $0$ yields $$ -X^T(y-X\beta)+\lambda\beta=0 \iff (X^TX+\lambda I)\beta = X^Ty $$ yielding $$ \hat{\beta} = (X^TX+\lambda I)^{-1}X^Ty $$ thus we can conclude that the $I$ in the formula is the identity matrix (diagonal matrix with ones on the diagonal and zeros elsewhere). Just like Yining Wang says, ridge regession is used when the condition number of $X^TX$ is large, or when the $X^TX$ (for some reason) is not full rank. It is also used when the 2-norm of the variable needs to be limited; as a matter of fact, solving $(1)$ is the same as solving (for some specific choice of $\lambda$ and $\tau$) $$ \underset{\beta}{\text{minimize}}\quad ||y-X\beta||_2^2\\ \text{subject to} \quad||\beta||_2^2\leq \tau $$