What types of noise allow linear regression interpolation to be unbiased?

58 Views Asked by At

In a statistics book I learned that if there are $n$ random variables $X_{t(0)},\dots,X_{t(n-1)}$ (with $t(i)\in\mathbf{R}$) which are independently distributed with distribution $\mathcal{N}(at(i)+b,1)$ (I believe the variance can also vary), then the variable $Y$ constructed by performing linear regression interpolation at time $s$ on the $X_{t(i)}$ has mean $as+b$.

They didn't give a proof, any reference where I can find one? And what if the noise is not normally distributed, would this approach still work?

1

There are 1 best solutions below

0
On BEST ANSWER

For simplicity I will consider the case where $b = 0$, though you can expand to include this.

Then given $(t_i)_{i=1}^n$ we have the assumption $x_i = x_{t(i)}$

$$ x_i = \mathcal{N}(a t_i, 1) = a t_i + \xi_i, \qquad \xi_i \sim \mathcal N(0,1).$$

For a regression model without intercept (i.e. $b = 0$), the regression line is determined by the approximation $\hat a$ which is given by the formula

$$ \hat a = \frac{\sum_{i=1}^n x_i t_i}{\sum_{i=1}^nt_i^2}.$$

So we would like to show that this is unbiased, i.e. that $\mathbf E[\hat a] = a$. To see this we plug in the formula for $x_i$ given above

$$ \begin{aligned} \mathbf E[\hat a] & = \frac{1}{\sum_{i=1}^n t_i^2} \sum_{i=1}^n t_i \mathbf E[x_i] \\ & = \frac{1}{\sum_{i=1}^n t_i^2} \sum_{i=1}^n t_i(at_i + \mathbf E[\xi_i] ) \\ & = \frac{1}{\sum_{i=1}^n t_i^2} \sum_{i=1}^n a t_i^2 \\ & = a \end{aligned} $$ which is as required.

Note that in the above we did not require that the errors are normally distributed: in fact so long as the model can be formalised to have errors with mean $0$ then the linear regression above will be unbiased.

What does change if either

  1. The error is not normal.
  2. The error is normal, but is dependent on $t_i$

is whether the linear regression line remains the best linear unbiased estimate: i.e. the one with lowest variance. This is generally not the case.

As an example (which I do not prove), if we have

$$ x_i = a t_i + \sqrt{t_i} \xi_i, \qquad \xi_i \sim \mathcal N(0,1),$$

then as before the linear regression line remains unbiased, however the alternate estimate

$$ \tilde a = \frac{\sum_i{x_i}}{\sum_i{t_i}}$$

is the best linear unbiased estimate (BLUE): and in particular

$$\text{Var}(\tilde a) \leq \text{Var}(\hat a).$$

Notes

  1. In the above treatment I have made the simplifying assumption that the $t_i$ are fixed and non-random. This simplifies the detail, but has little overall impact on the answer since we would instead consider the conditional expectation $\mathbf E[\hat a| \underline t]$.