Linear Fit: why do we minimize the variance and not the sum of all deviations?

Question

Linear Fit: why do we minimize the variance and not the sum of all deviations?

900 Views Asked by Bumbble Comm At 15 May 2026 - 7:17

my question is about the linear fit and the least squares method.

Why do we decide to minimize the quantity

$$ S = \sum_{i=1}^n r_i^2 $$

instead of this one:

$$ r_i = y_i - f(x_i, \beta) $$

?

Let's think at the least squares fit intuitively: I want to find a straight line which approximates the distribution of some points in the $xy$ plane. The first thing I think is that it is good to find a line such that its distance $(r_i)$ from each point is minimum, not necessarily the square of that distance. Why this choice?

Original Q&A

There are 3 best solutions below

Bumbble Comm On 13 Oct 2019 - 3:02

Edit: My first pass at an answer perhaps wasn't the most helpful. I've refined this to include a comment about the geometric intuition that helped me first digest the OLS.

The short answer is the Gauss-Markov theorem. The Ordinary Least Squares (OLS) regression is the Best Linear Unbiased Estimator.

Maybe a quick bit of intuition that will satisfy you is to consider how we usually calculate the distance between two points in space: using Pythagoras' theorem. Of course, Pythagoras' theorem quite explicitly takes the form of a sum of squares, and this is geometrically why the OLS estimator gives us the line which has the least geometric distance to all of the points (and this generalises to higher dimensions just as well).

It's also worth considering that your proposed estimator should really involve an absolute value somewhere. You wouldn't consider a model that substantially overpredicts half the time and substantially underpredicts half the time to be 'as good as' one that predicts as correctly as possible all the time, so you need an absolute value (a squared function implicitly takes the absolute value, since $|x|^2 = x^2$).

Given a range of different unbiased estimators, how would you decide which one is the 'best'? They all have zero bias, so you have to look to the next moment; which one has the least variance? The OLS uniquely (by the Gauss-Markov theorem) has the least variance of all unbiased linear estimators. In general we want to ascribe more weight to correcting larger errors than smaller ones, and to be precise it turns out that the square of the difference is exactly the right notion for this, but if you want the details you'd have to track through the Gauss-Markov proof.

Bumbble Comm On 13 Oct 2019 - 3:17

Your approach was done by Laplace -- according to the inventor of the least squares, Gauss himself. He wrote (freely translated)

Laplace used [...] the positive value of the error for an estimation of the loss. If we are not quite wrong, this declaration surely isn't less arbitrary as ours: whether the double error is as bearable as the repeated simple one or even worse and whether it's hence more bearable to ascribe the double error only the double moment or a bigger one, can't be determined by mathematical proofs but is rather a matter of appreciation.

He goes on in remaking that the approach of Laplace

contradicts continuity,

that's why

this method opposes to higher degree the analytical treatment, as the results of our method are marked especially as well by simplicity as by generality.

Theorie der den kleinsten Fehlern unterworfenen Combination der Beobachtungen, p. 6.

**Bumbble Comm** · Accepted Answer

Using the sum of the signed deviations doesn't really make sense because then you can just make $b$ arbitrarily large and positive to get a large negative signed deviation.

Minimizing the absolute deviations, i.e. minimizing $\sum_{i=1}^n |y_i-f(x_i,\beta)|$, would make a certain degree of sense.

The most straightforward reason to favor least squares instead is the Gauss-Markov theorem, which says (in the simplest case) that if the $y_i$ are random additive perturbations of $f(x_i,\beta)$ where the perturbations have mean zero, are uncorrelated, and have the same finite variance, then least squares provides the best linear unbiased estimator for the parameters. Some of these assumptions can also be relaxed by considering weighted versions of least squares.

But if you just have data that you want to fit (to interpolate, perhaps) and aren't thinking in terms of a noise model, then a different metric might work out. Some advantages of least squares from this point of view include:

It's easy to find an expression for the parameters. It's not so easy to do the same for the absolute deviations; in fact probably the easiest way to do it for the absolute deviations is to consider minimizing the $p$th power deviations in the limit $p \to 1^+$.
Bigger deviations are penalized proportionally more. (This is a subjective advantage, though; a mathematician cannot prove to you that it is better this way.)

Linear Fit: why do we minimize the variance and not the sum of all deviations?

There are 3 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in MEASURE-THEORY

Related Questions in APPROXIMATION

Related Questions in LEAST-SQUARES

Related Questions in APPROXIMATION-THEORY

Trending Questions

Popular # Hahtags

Popular Questions