Testing Ridge Regression coefficient with a known variance

68 Views Asked by At

Let $Y = X\beta + \epsilon$, where $\epsilon$ is a Gaussian noise with $0$ mean, and $\sigma^2$ variance.

Let $\beta = (\beta_1^T, ..., \beta_p^T)$. The ridge estimator of $\beta$ is given by $\hat{\beta} = (X^TX+\lambda I)^{-1}X^TY$.

If $\sigma^2 = 1$, to test whether $\beta_k = 0$ is the regular t-test for OLS still applicable here? Then to test whether $\beta = \textbf{0}$, is the F-test still applicable?

For the first part, I am guess that the OLS still works, in fact, since the variance is known, it will be a Z-test.

I don't really know whether the F-test still works, but I am guessing that it doesn't.

How correct/wrong am I?

1

There are 1 best solutions below

4
On BEST ANSWER

Let's start with a small sample argument. In this context, I am assuming that

  • (H1) The true data generating process is $Y = X\beta + \epsilon$ ($\beta$ may be sparse or not);

  • (H2) $X$ is of full column rank (s.t. $X'X$ is nonsingular)

  • (H3) $E(\epsilon | X) = 0_{n \times 1}$

Note that (H3) exclude autoregressive terms if you are working with time series data. Now, under these assumptions, it is easy to show that OLS is unbiased, but we're going to look at ridge instead: \begin{align} \hat{\beta} &= \left( X'X + \lambda I_T \right)^{-1} X'Y \\ &= \left( X'X + \lambda I_T \right)^{-1} X'X \beta + \left( X'X + \lambda I_T \right)^{-1} X'u \\ E \left( \hat{\beta} \right) &= E \left( \left( X'X + \lambda I_T \right)^{-1} X'X \beta + \left( X'X + \lambda I_T \right)^{-1} X'u \right) \\ &= E \left[ E \left( \left( X'X + \lambda I_T \right)^{-1} X'X \beta + \left( X'X + \lambda I_T \right)^{-1} X'u \bigg| X\right) \right] \\ &= E \left[ \left( X'X + \lambda I_T \right)^{-1} X'X \beta + \left( X'X + \lambda I_T \right)^{-1} E \left( X'u \big| X\right) \right] \\ &= E \left[ \left( X'X + \lambda I_T \right)^{-1} X'X \beta \right] \end{align} I invoke (H1) to get the second line. Then I took expectations on both sides, applied the law of iterated expectations, took advantage of the conditioning information to run the expectation all the way to the end and used (H3) to find that the second matrix is null.

The problem is that for $\lambda \neq 0$, this last terme generally isn't $\beta$ such that your estimator is not unbiased. Specifically, for $\lambda > 0$, this is actually biased towards zero. That's kind of embarrassing if you're trying to test for $\beta = 0$! ;)

Now, what is we work asymptotically? \begin{align} \hat{\beta} &= \left( \frac{X'X}{T} + \frac{\lambda I_T}{T} \right)^{-1} \frac{X'X}{T} \beta + \left( \frac{X'X}{T} + \frac{\lambda I_T}{T} \right)^{-1} \frac{X'u}{T} \end{align} If we make assumptions which allows some law of large number to apply, all those would converge to appropriate covariance matrices and, as long as $\lambda$ is not allow to grow as fast as the sample size $T$, the $\frac{\lambda I_T}{T}$ terms would collapse to null matrices. Therefore, provided $E(X_t u_t) = 0_{K \times 1}$ (I assumed you have $K$ variables, so this just says nothing in the model is correlated with error terms), the last matrix would be null and we'd get $\hat{\beta} \rightarrow \beta$. Asymptotically, your "hint" about known variance is useless because you'd use a convergent estimator (that's why the laws, asymptotically, are normal and $\chi^2$, not Student and Fisher).

In that case, you could run statistical tests using an asymptotic justification -- $t$ statistics would be asymptotically normal and Wald/Likelihood ratio/score statistics would be asymptotically $\chi^2$. However, it's pretty damn clear that it's a bad idea, empirically: if $\lambda$ matters at all for estimation, your tests will have sizes that exceeds the nominal size -- as well as poor power.