Linear regression with and without error

287 Views Asked by At

I'm taking a machine learning course and I'm in the "supervised learning" part that uses linear regression as a statistical tool. What I don't understand is why the linear regression model is used WITHOUT error:

$h_{\theta}(X)=\theta_0+\theta_1X$

instead of the next model:

$h_{\theta}(X)=\theta_0+\theta_1X+\epsilon_i$

The same is true for several variables.

  • Why is it considered this way?
  • Is it sometimes not necessary to take error into account?
  • When the model is used without error are the predictions valid?
1

There are 1 best solutions below

0
On

To understand the reasoning behind why we don't bother with the error term, you need the following mathematical result:

Theorem. The best estimator of $Y$ (with respect to least-squares expectation) is $\mathbb{E}[Y \mid X]$.

Proof. See the answer I provided at https://stats.stackexchange.com/a/320442/46427. $\square$

Thus, if we have $Y_i = \theta_0 + \theta_1 X_i + \epsilon_i$, the best estimator of $Y_i$ is $$\mathbb{E}[Y_i \mid X_i] = \mathbb{E}[\theta_0 + \theta_1 X_i+\epsilon_i\mid X_i] = \mathbb{E}[\theta_0 \mid X_i] + X_i\mathbb{E}[\theta_1 \mid X_i] + \mathbb{E}[\epsilon_i \mid X_i] = \theta_0 + \theta_1X_i$$ since we assume $\mathbb{E}[\epsilon_i \mid X_i] = 0$, and $\theta_0, \theta_1$ are constants.

Do you ever need to take into account error? Sure. Generally speaking, one assumes homoskedasticity in regression - i.e., that all $Y_i$s have the same variance $\text{Var}[Y_i \mid X_i] = \sigma^2 > 0$. But if, you for example, want variances to differ by observation, you could add an error term for each observation, thus including a variance for each observation. This requires substantially more background than what you've posed here, but you can get a start on this concept with mixed-effects models.