Question about the objective function of Linear regression

4.5k Views Asked by At

Suppose we have some data and I am fitting those data with a simple linear regression model. As the graph show below, the black line represents the true model that generated the data. Denoted it as: $$y = \beta_0 + \beta_1 x_1 + \epsilon$$ where $\epsilon$ is normal distributed with mean 0 and variance $\sigma^2$. The red line represents the estimated model and denoted it as: $$\hat y = \hat\beta_0 + \hat\beta_1 x_1$$ Let the residuals denoted by $\hat\epsilon$. The objective of linear regression is to minimize the sum of the square of residuals $\sum_{i=1}^n{\hat\epsilon^2}$ so that we can find a estimated line that is close to the true model. However, intuitively, in order to find a estimated line that is as close as possible to the true line, we just need to minimize the distance between the true line and the estimated line. That is, $|\hat\epsilon - \epsilon|$(as the graph shown below). This leads to the new objective function $Min \sum(\hat\epsilon - \epsilon)^2$.

What confuse me the most is that the least square method is trying to fit a estimated model that is as close as possible to all the observations but not the real model. However, a estimated model that is close to the observations doesn't guarntee to be also close to the real model since an observation with a large error term will draw the estimated line away from the true line. For this reason, the objective function $Min \sum(\hat\epsilon - \epsilon)^2$ makes more sense to me even thought in practice, we can not take $Min \sum(\hat\epsilon - \epsilon)^2$ as our objective function since $\epsilon$ is unknown.

My question is then, why do we use Min$\sum\hat \epsilon^2$ as our objective function if it is not guranteed to generate a model that is close to the true model? Are $Min \sum\hat\epsilon^2$ and $Min \sum(\hat\epsilon - \epsilon)^2$ equivalent to each other(Or one could lead to another)?

Any help would be appreciated. Thanks in advance.

enter image description here

2

There are 2 best solutions below

1
On

Comments:

(a) In some applications one minimizes $D = \sum_i |\hat e_i|$ instead of $Q = \sum_i \hat {e_i}^2.$ An advantage of $D$ (my notation) is that it puts less emphasis on points far from the line produced by data. (But one usually pays due attention to points far from the usual line made using $Q;$ this is part of 'regresssion diagnostics'.) Advantages of using $Q$ are computational simplicity and existence of standard distributions to use in testing and making CIs.

(b) As mentioned by @littleO, expressions involving $\epsilon_i$ are off the table because the $\epsilon_i$ are not known.

(c) As for general 'objectives' of regression, I immediately think of two: First, prediction of $y$-values from new $x$-values (not used in making the line). Second, understanding relationships among variables: either to verify known theoretical relationships as holding true in practice or to discover new relationships.

Note: Recent hand surgery has reduced me to hunt-and-peck typing for a few days, and probably to making even more typos than usual. If you have rep to fix them, please feel free.

3
On

Another way to fit data is to try and fit the Total Least Squares. It does not only consider the error in the function direction but also in the variable direction. The shortest squared distance to the curve: ${\epsilon_x}^2+{\epsilon_y}^2$ instead of just the $y$ direction.

Here is a picture showing the total least squares distances: (by Netheril96 (from Wikipedia)):

enter image description here