Suppose I have a linear model $y_i=a\cdot x_i+b+\epsilon_i$, where $y_i,\epsilon_i,a,b\in\mathbb{R}$, $x_i\in[-1,1]$. I can take n measurements of $y_i$ at $x_i$, where $n\in\mathbb{N}$. $\epsilon_i$ denote the errors of measurement. I use $x_i$ and $y_i$ to fit $a$ and $b$, with simple linear regression. My goal is to get the most out of my limitted measurements. Which essentially means, I want the standard error of the estimator $\hat{a}$ and $\hat{b}$ as small as possible, see https://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption. (Suppose the normality assumptions hold). But the only thing I can choose are the values of $x_i$. How should I choose them?
My thoughts up to now:
- Naive I would place them equispaced in my interval.
- Looking at Wikipedia's $s_\hat{\beta} = \sqrt{ \frac{\tfrac{1}{n-2}\sum_{i=1}^n \hat{\varepsilon}_i^{\,2}} {\sum_{i=1}^n (x_i -\bar{x})^2} }$ I would make $\sum_{i=1}^n (x_i -\bar{x})^2$ as big as possible, which means $x_i=(-1)^i$. But then i would measure the same $x_i$ multiple times, which feels wrong.
I remember from a lecture (10 years ago), that you preferably choose $x_i=cos(\frac{i⋅π}{n})$. But I can't find any references to that any more.It turns out, my memory was wrong.
Any idea where to investigate or read up?
Here's a plot that shows two extremes: make $\sum{ (x_i - \bar{x})^2 }$
Lines are fit to $x_i, y_i$ with 5 Monte Carlo runs, $y_i \sim \mathbb{N}$ .
Obviously, the top lines are pretty nearly horizontal, but lines fit to [all $x_i$ near 0, $y_i$] are nonsense.
Line fits with $x_i$ uniformly spaced, or cos spaced, fall roughly between these two extremes -- in Monte Carlo. (Who said, "one good theory is worth 1000 computer runs" ?)
My (amateur) opinion is that the setup is screwy: are there real cases where you're free to choose the $x_i$, but nature adds noise independent of $x$ ?
People on stats.stackexchange may have better answers. See e.g. understanding-shape-and-calculation-of-confidence-bands-in-linear-regression