calculating least squares fit

517 Views Asked by At

I read this thread talking about 'why we use least squares' for curve fitting

Why do we use a Least Squares fit?

One answer by Chris Taylor begins with the assumption that we should look for

$$ y_i=ax_i+e_i $$

This reference:

http://www.bradthiessen.com/html5/docs/ols.pdf

also supports Chris' choice and states that "We could measure distance from the points to the line horizontally, perpendicularly, or vertically. Since we oftentimes use regression to predict values of Y from observed values of X, we choose to measure the distance vertically."

But would it not be better to measure the 'perpendicular distance'? For example, if we assume that our 'fitted line' is as above, then the perpendicular distance, $\Delta$ from a predicted point $(x_i, y_i)$ to an actual data point $(x_0, y_0)$ would be $\Delta^2 = (x_i - x_0)^2 + (y_i - y_0)^2$

so the error, E, would be $$ E = \sum_{i=1}^n \Delta_i $$ This is where I get a little confused. I know that

$$ y_i=ax_i+e_i $$ $$ x_i=(y_i-e_i)/a $$

so substituting for $(\Delta_i)^2$

$$ E = \sum_{i=1}^n ((\frac{y_i-e_i}{a}) - x_0)^2 + ((ax_i+e_i) - y_0)^2$$

Then I differentiate with respect to a and set that derivative equal to 0? Not only am I getting lost here but my Latex skills are failing. The equation gets pretty complicated but it can be calculated. The question is, 'is it more accurate to use perpendicular distance rather than vertical distance'?

5

There are 5 best solutions below

2
On BEST ANSWER

The reason for using vertical distance is that often you have a knob you turn that controls the important parameter of the experiment and then you measure the output. We believe you can set this parameter exactly (or so close that any error of the point is all in the measurement of the $y$ value, not the $x$ value). This is appropriate as long as the error in $x$ is small compared to the error in $y$ divided by $\frac {dy}{dx}$. This is often true, but not always.

0
On
  1. To use the "nearest point on the line" involves drawing a perpendicular from the point to the line. It is more complicated than the expression in the OP.

  2. There is a good reason apart from simplicity to use the vertical distance. In many circumstances, the data points are not just arbitrary scattered points in the plane. The x-coordinate represents known quantities, while the y-coordinate represents measured data (which may contain variability and error). For example, we measure the population of a city every year. The x-axis measures years. The vertical distance to the best-fit line is modeling deviation of the population, in each given year, from the model's prediction. If instead we used a diagonal best-fit line as proposed, there is no such natural explanation of what that line is modeling.

8
On

The $x$-variable may be in kilograms and the $y$-variable in dollars. If you change kilograms to grams or to metric tons, or change dollars to cents, then the lines that were perpendicular in the plane before, no longer are. No such problem afflicts conventional least-squares.

And suppose you classify people according to occupation (any of six types of jobs, say) and native language, in a community in which there are 10 of those. And you have a model that predicts income based on those. It says $$ y_{ij} = \alpha+\beta_i+\gamma_j + \varepsilon_{ijk} $$ where $i$ is any of the six occupational classifications, $j$ is any of the native languages, and $\varepsilon_{ijk}$ appears in the $k$th obervation among all for whom the values of the classifications are $i,j$. You estimate $\alpha$, $\beta_i$, $\gamma_j$ based on a sample, using least squares. How do you measure "perpendicular distances" then?

0
On

Remember that regression are used for modelling, and thus it should not be reduced to a question of geometry.

given $x$, we wanna give a guess "$f(x)$" of what $Y$ will be. In our case our guess will be of the form $f(x)=ax+b$.

If we have observations $(x_1,y_1),\ldots,(x_n,y_n)$, we wanna measure how well our guess $ax+b$ fitted the data. A natural measure is the distance $r_i$ between what really happened $y_i$, and what we guessed it to be $f(x_i)$. That is $r_i=\vert y_i-(ax_i+b)\vert$. To get óne measure for all coordinates we sum all the differences(Residuals) into

$$F^*(a,b)=\sum_{i=1}^n r_i.=\sum_{i=1}^n \vert y_i-ax_i-b\vert $$ which we then wanna minimize in terms of $a,b$.

However, for various reasons we choose another intuitive "measure of good guess" $s_i=(y-ax_i-b)^2$ and minimize $$F(a,b)=\sum_{i=1}^n s_i$$

0
On

As said in previous answers, the problem depends on what is measured. Suppose that $x$ is with no error and $y$ is measured with some error. So, the vertical distance is the good choice.

For sure the problem can be different : suppose that for a know value of $x$ you measure $y$ (with some error) and also $z$ (with some error too) and, for any reason, you want to correlate $y$ to $z$. So, now, both independent and dependent variables are in error and the vertical distance is probably a good choice. However, you will need to know somehow the standard deviations associated to the $y_i$'s and to the $z_i$'s and this is what makes the problem often difficult.

If you ggogle for "orthogonal distance regression", "total least squares", you will find some interesting papers and discussions on this topic.

For the fit of a straight line through origin $(y=a\, x)$, the problem is not too bad and not complex. Consider a data point $(X_i,Y_i)$. It will be on the normal line to $y=a \,x$ and writing the normal as $$y=-\frac 1 a x+b$$ this gives $$Y_i=-\frac 1 a X_i+b$$ then the value of $b$. So, the two lines intersect at $$x_0=\frac{X_i+aY_i}{1+a^2} \ \ ,\ \ y_0=a\frac{X_i+aY_i}{1+a^2}$$ So, the square of the distance from the point $(X_i,Y_i)$ to the point$(x_0,y_0)$ is just $$d_i^2=\frac{(Y_i-a X_i)^2}{a^2+1}$$ and the problem is then to minimize $$\Phi(a)=\sum_{i=1}^n \frac{(Y_i-a X_i)^2}{a^2+1}$$ As usual, we take the derivative with respect to $a$ and after some algebra we arrive to a quadratic equation in $a$ $$\Big(\sum_{i=1}^n X_iY_i\Big)a^2+\Big(\sum_{i=1}^n (X_i^2-Y_i^2)\Big)a-\Big(\sum_{i=1}^n X_iY_i\Big)=0$$ which, for sure, is more complex than $$a=\frac{\sum_{i=1}^n X_iY_i}{\sum_{i=1}^n X_i^2}$$ given for the regression based on vertical distances.

For the fit of a straight line with intercept, the problem is already much more complex and the expressions of the parameters are not explicit.