Assume a standard linear model
I) $Y = \beta_1 X_1 + \epsilon$
and the alternative
II) $Y = \beta_1 X_1 + \beta_2 X_2 + \epsilon$,
with both times $\epsilon \sim \mathcal{N}(0, \sigma^2)$. This means that we do not model the first linear model to be affected by additional noise due to ignoring the predictor $X_2$.
Now assume $X_1$ and $X_2$ are correlated. How can it be that we obtain a lower mean squared error for model II) than for model I)? If $X_1$ and $X_2$ are empirically uncorrelated, then we can show that the reverse is true. We see that the covariance matrix trace, conditionally the data, of the multiple least squares estimators $(\hat{\beta}_1, \hat{\beta}_2)$ in this case takes the following form:
$$\text{tr}(\text{cov}((\hat{\beta}_1, \hat{\beta}_2))) = \sigma^2 \left( (\mathbf{x}_1^{\top} \mathbf{x}_1)^{-1} + (\mathbf{x}_2^{\top} \mathbf{x}_2)^{-1} \right) \geq \sigma^2 (\mathbf{x}_1^{\top} \mathbf{x}_1)^{-1} = \text{var}(\hat{\beta}_1), $$
where $\mathbf{x}_i$, $i \in \{ 1, 2\}$ correspond to the data vector for that specific feature and $\mathbf{x}^{\top}_1 \mathbf{x}_2 = 0$.
My question is whether we can show that this inequality can turn around if we have that $\mathbf{x}^{\top}_1 \mathbf{x}_2 \neq 0$. Is it possible to construct a simple toy example where this happens? I find it hard since the inverse of the data matrix will not be as simple to compute for $\mathbf{x}^{\top}_1 \mathbf{x}_2 \neq 0$.