Optimal place to measure for simple linear regression/fitting

90 Views Asked by At

Suppose I have a linear model $y_i=a\cdot x_i+b+\epsilon_i$, where $y_i,\epsilon_i,a,b\in\mathbb{R}$, $x_i\in[-1,1]$. I can take n measurements of $y_i$ at $x_i$, where $n\in\mathbb{N}$. $\epsilon_i$ denote the errors of measurement. I use $x_i$ and $y_i$ to fit $a$ and $b$, with simple linear regression. My goal is to get the most out of my limitted measurements. Which essentially means, I want the standard error of the estimator $\hat{a}$ and $\hat{b}$ as small as possible, see https://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption. (Suppose the normality assumptions hold). But the only thing I can choose are the values of $x_i$. How should I choose them?

My thoughts up to now:

  • Naive I would place them equispaced in my interval.
  • Looking at Wikipedia's $s_\hat{\beta} = \sqrt{ \frac{\tfrac{1}{n-2}\sum_{i=1}^n \hat{\varepsilon}_i^{\,2}} {\sum_{i=1}^n (x_i -\bar{x})^2} }$ I would make $\sum_{i=1}^n (x_i -\bar{x})^2$ as big as possible, which means $x_i=(-1)^i$. But then i would measure the same $x_i$ multiple times, which feels wrong.
  • I remember from a lecture (10 years ago), that you preferably choose $x_i=cos(\frac{i⋅π}{n})$. But I can't find any references to that any more. It turns out, my memory was wrong.

Any idea where to investigate or read up?

1

There are 1 best solutions below

0
On

Here's a plot that shows two extremes: make $\sum{ (x_i - \bar{x})^2 }$

  • as big as possible, $x_i = \pm$ 1 (top)
  • as small as possible, $x_i$ near 0 (bottom).
    Lines are fit to $x_i, y_i$ with 5 Monte Carlo runs, $y_i \sim \mathbb{N}$ .

enter image description here

Obviously, the top lines are pretty nearly horizontal, but lines fit to [all $x_i$ near 0, $y_i$] are nonsense.
Line fits with $x_i$ uniformly spaced, or cos spaced, fall roughly between these two extremes -- in Monte Carlo. (Who said, "one good theory is worth 1000 computer runs" ?)

My (amateur) opinion is that the setup is screwy: are there real cases where you're free to choose the $x_i$, but nature adds noise independent of $x$ ?

People on stats.stackexchange may have better answers. See e.g. understanding-shape-and-calculation-of-confidence-bands-in-linear-regression