Combining multiple regression formulas

167 Views Asked by At

The normal equation for ordinary least squares regression is as follows:

$$ \hat{w} = (X^TX)^{-1}X^Ty $$

but this gives a model that underfits. One way to counter underfitting is to use locally weighted linear regression using a Gaussian kernel. The formula goes as follows:

$$ \hat{w} = (X^TWX)^{-1}X^TWy $$

where W is a diagonal matrix of weights. The closer an unknown point $x$ is to example $i$ in the training set, the higher the value for $W[i,i]$. So instead of getting a straight line, line can now be curvy depending on the size of the kernel.

and finally, there's ridge regression that shrinks the regression weights to avoid overfitting. The formula goes as:

$$ \hat{w} = (X^TX + \lambda I)^{-1}X^Ty $$

My question is: can equations 2 and 3 be combined to get the best of both the worlds? To have a line that isn't straight but also doesn't overfit the data??

$$ \hat{w} = (X^TWX + \lambda I)^{-1}X^TWy $$

1

There are 1 best solutions below

0
On BEST ANSWER

The basic model behind the first equation is

$$ y =X\beta + \epsilon $$

with $\epsilon$ a sample out of multivariate normal $\mathcal N(0,\sigma^2 I)$. Assuming this model holds, then the best linear unbiased estimator (BLUE) of $\beta$ is the OLS:

$$ \hat\beta = (X^tX)^{-1}X^ty $$

Now, if you believe that your error is not homoscedastic so that for some reason you think that $\epsilon \sim \mathcal N(0,\Sigma)$ for some covariance matrix $\Sigma$ then what you can do is consider the Cholesky decomposition of $\Sigma$: $\Sigma=LL^t$ you can then write:

$$ y =X\beta + L \epsilon' $$

with $\epsilon'\sim \mathcal N(0,I)$. With $\Sigma$ positive definite, $L$ is invertible, and you can write

$$ L^{-1}y = L^{-1}X\beta + \epsilon' $$

then you're back in the first scenario, with the BLUE being $\hat\beta = (X^t\Omega X)X^t \Omega y$ with $\Omega=\Sigma^{-1}=(LL^t)^{-1}=L^{-t}L^{-1}$.

If you can approximate the inverse of the covariance matrix (the $\Omega$) for example with a diagonal matrix, you're in the second scenario you're referring to.

Now, it's not hard to link this with Ridge regression, with homoscedastic errors, you regularize with

$$ \min_\beta \|X\beta - y\|_2^2 + \lambda \|\beta\|_2^2 $$

and the answer is, as you wrote: $\hat\beta_R = (X^t X+\lambda I)^{-1}X^t y$.

Let's do exactly the same thing after transforming the data with the square root $L^{-1}$ of $\Omega=L^{-t}L^{-1}$ (or an approximation):

$$ \min_\beta \| L^{-1}X\beta - L^{-1} y\|_2^2 + \lambda\|\beta\|_2^2 $$

just rewrite $X'=\Omega X$ and $y'=\Omega y$, the problem is then the same than in the Ridge regression case and therefore, indeed, the solution is the "combination":

$$\hat\beta_R' = (X^t\Omega X+\lambda I)^{-1}X^t \Omega y. $$