Mean Squared Error decreases when more predictors are added

27 Views Asked by At

Assume a standard linear model

I) $Y = \beta_1 X_1 + \epsilon$

and the alternative

II) $Y = \beta_1 X_1 + \beta_2 X_2 + \epsilon$,

with both times $\epsilon \sim \mathcal{N}(0, \sigma^2)$. This means that we do not model the first linear model to be affected by additional noise due to ignoring the predictor $X_2$.

Now assume $X_1$ and $X_2$ are correlated. How can it be that we obtain a lower mean squared error for model II) than for model I)? If $X_1$ and $X_2$ are empirically uncorrelated, then we can show that the reverse is true. We see that the covariance matrix trace, conditionally the data, of the multiple least squares estimators $(\hat{\beta}_1, \hat{\beta}_2)$ in this case takes the following form:

$$\text{tr}(\text{cov}((\hat{\beta}_1, \hat{\beta}_2))) = \sigma^2 \left( (\mathbf{x}_1^{\top} \mathbf{x}_1)^{-1} + (\mathbf{x}_2^{\top} \mathbf{x}_2)^{-1} \right) \geq \sigma^2 (\mathbf{x}_1^{\top} \mathbf{x}_1)^{-1} = \text{var}(\hat{\beta}_1), $$

where $\mathbf{x}_i$, $i \in \{ 1, 2\}$ correspond to the data vector for that specific feature and $\mathbf{x}^{\top}_1 \mathbf{x}_2 = 0$.

My question is whether we can show that this inequality can turn around if we have that $\mathbf{x}^{\top}_1 \mathbf{x}_2 \neq 0$. Is it possible to construct a simple toy example where this happens? I find it hard since the inverse of the data matrix will not be as simple to compute for $\mathbf{x}^{\top}_1 \mathbf{x}_2 \neq 0$.