Why does regression use least "squares" instead of least "absolute values"?

5.2k Views Asked by At

Linear regression uses summation of least squares to find the best fit. Why? I fully understand that we do not want to use actual residuals, otherwise, positive and negative numbers may cancel out each other. Then, why don't we use absolute values? Sorry if this sounds like a duplicate question. I did see many explanations but did not see an easy-to-understand answer. For example, some said that squares made calculation easier. How come?

Your insight is highly appreciated!

6

There are 6 best solutions below

3
On

$$\min_{a,b}\sum_{k=1}^n(ax_k+b-y_k)^2$$ has a simple analytical solution.

$$\min_{a,b}\sum_{k=1}^n|ax_k+b-y_k|$$ is difficult.

One of reasons is that the absolute value is not differentiable.

1
On

It is easy to minimize the error when it is given by the least squares. Consider the following: there given points $(x_k,y_k), \ k=1,\ldots,n $ and you want to find $a,b$ constants such that $y \approx ax+b$. What does $y\approx ax+b$ mean? E.g. $E(a,b):=\sum_{k=1}^n (y_k-ax_k-b)^2$ is minimal in $a,b$. Now \begin{align*} \frac{\partial}{\partial a} E(a,b) &= -2\sum_{k=1}^n (y_k-ax_k-b)x_k = 0\\ \frac{\partial}{\partial b} E(a,b) &= -2\sum_{k=1}^n (y_k-ax_k-b) = 0 \end{align*} The solution is given as the solution to $$ \begin{bmatrix}1 & \frac1n\sum_{k=1}^n x_k \\ \frac1n\sum_{k=1}^n x_k & \frac1n\sum_{k=1}^n x_k^2 \end{bmatrix}\begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} \frac1n\sum_{k=1}^n y_k \\ \frac1n\sum_{k=1}^n x_ky_k \end{bmatrix} $$ it can be shown that this is indeed a minimum by looking at Hessian of $E(a,b)$.

1
On

In actuality least absolute value methods of regression is sometimes used, but there are a few reasons why least squares is more popular.

1) In calculus, when trying to solve an optimization problems (which is what regression is, minimizing error) we take the derivative to find the points where it is equal to 0. When differentiating, absolute value signs are a nightmare and create a kind of piecewise function whereas squares are far simpler to differentiate, especially due to their non-linearity.

2) Least squares regression lines are more efficient (they don't require as great of a number of samples to get a good estimate of the true regression line for the population).

But in all honesty, least squares is more common because it ended up that way. There are many good arguments as to why in many scenarios least absolute value is better, including the fact that least squares regression is far more sensitive to outliers.

This is shown in this example. Sourced from: https://demonstrations.wolfram.com/ComparingLeastSquaresFitAndLeastAbsoluteDeviationsFit/

3
On

As mentioned by others, the least-squares problem is much easier to solve. But there’s another important reason: assuming IID Gaussian noise, the least-squares solution is the Maximum-Likelihood estimate.

0
On

In addition to the previous answers, I want to highlight the differences in the solutions obtained when optimizing each of the two objective functions. In particular, if we look at the response variable $y$ conditioned on the explanatory variables $\mathbf{x}$, that is $y | \mathbf{x}$, the algorithm estimates

  • the mean of response values, in the case of squared differences;
  • the median of the response values, in the case of of absolute differences.

By replacing the absolute value with a tilted absolute value loss function, we obtain quantile regression. The figures below exemplify the differences in solutions for the two methods (these images were taken from this assignment, see §2):

enter image description here

The same resource provides some motivating examples for using quantile regression:

  • A device manufacturer may wish to know what are the 10% and 90% quantiles for some feature of the production process, so as to tailor the process to cover 80% of the devices produced.
  • For risk management and regulatory reporting purposes, a bank may need to estimate a lower bound on the changes in the value of its portfolio which will hold with high probability.
0
On

One can think of a set of $n$ observations as being an $n$-dimensional vector. We then have the Euclidean norm $\sqrt {\sum (y_i-\hat y_i)^2}$. Since minimizing the square root of a value is the same as minimizing the value (for positive numbers), it's simpler to talk of finding the least squares, rather than finding least root mean square.

Using $\sum (y_i-\hat y_i)^2$ over $\sqrt {\sum (y_i-\hat y_i)^2}$ has further advantages, such as that we can split $\sum y_i^2$ into the "explained" part $\sum (y_i-\hat y_i)^2$ and the "unexplained" part $\sum y_i^2-\sum (y_i-\hat y_i)^2$.

Once we have the Euclidean norm, many questions can be answered by looking at the geometry of the space. For instance, if we take the set of points obtained by $\hat y = mx+b$, this is a plane in the space. Finding the least squares means finding the point on this plane closest to the observation vector, which can be obtained simply by looking at the hyperplane perpendicular to that plane that goes through the observation vector, and seeing what point it intersects the plane, which is a simple linear algebra problem.