Regression How the assumption of independent noise relates to the real noise?

119 Views Asked by At

Let us first consider a ground-truth: $$y_i= \theta_0 x_i + e_i,$$ where $\theta_0 \not= 0$ is a constant and $e_i$ is an independent random noise over $i$. Suppose that I can obtain measurements of $y_i$ and $x_i$ over $i$.

Then to estimate the unknown parameter $\theta_0$, we typically consider a linear regression model of the above ground-truth: \begin{equation} y_i= \theta x_i + \epsilon_i \end{equation} where $\theta$ is a to-be-determined parameter and $\epsilon_i$ is assumed to be independent over $i$. Let us say that $\theta \in \Theta \subseteq \mathbb{R}$, which is the range where we search for an estimate $\hat{\theta}$ of $\theta_0$.

Now I am a bit confused by the above assumption of independence on $\epsilon_i$ in the regression model.

By this assumption, are we assuming that $\epsilon_i$ is independent over $i$ for any $\theta \in \Theta$?

Or are we assuming that: if $\theta_0 \in \Theta$, then there exists a $\theta = \theta_0$, such that $\epsilon_i=e_i$ leads to an independent varible over $i$?

Then what if the model I choose does not include the ground-truth (suppose I do not know the right model structure), e.g. consider the model $$y_i =\theta+ \epsilon_i,$$ where $\theta$ is a to-be-estimated parameter. Then can I still assume that $\epsilon_i$ is independent over $i$ in the above model? (This is probably a bad example but we can well have cases that the ground truth is not in the model set.)

1

There are 1 best solutions below

3
On

I don't know what kind of notation you're using but its different nonetheless.

Typical theory books construct the model as such: $y=\beta_0+\beta_1x+\epsilon$

However, the regressional model assumes that the 'noise', or random error $\epsilon$, is distributed independently normal. Which means that for any two different observations they are independent; that is, the errors associated with $y_i$ have no correlation on $y_k$ for $i\ne k$

The value of $\theta$ has no influence on the independence of the errors, and visa versa. There are not some values of $\theta$ that make $\epsilon$ not independent.

In practice, it is hard to tell whether the errors are not independent or not, but it usually occurs in a time-series dataset; but you can check for normality by constructing a stem-and-leaf display or histogram of the residuals. If the errors are not distributed normally, you can perform a transformation to 'normalize' more.

Lastly, in the model you gave, $y=\theta+\epsilon$, I do not see a purpose in performing regression when there is no x given. In fact there are infinitely number of possibilities for $\theta$ if there is no x, as any combination of $\theta$+ $\epsilon$=$y$