Is minimizing the squared errors optimal?

462 Views Asked by At

In least squares regression, we try to minimize the sum of the squares error terms. I was wondering if this would unfairly penalize a model for having terms that are too far away. For example, a term that is $1$ unit away from the line of best fit would have $1^2=1$ effect on the MSE, but a term that is $2$ units away would have $2^2=4$ effect on the MSE.

It seems to me that the term that is 2 units away seems to have 4 times as large an influence on the "performance" of a linear model compared with the first term.

I was wondering why this was the case, and if there's some intuitive explanation for why this is so. Is the square used because cubics, quartics, etc, are harder to work with? I'd imagine it's easier to optimize for a square than an absolute value. Is there some fundamental reason for why this is so or is it convention?

If anybody would have an intuitive explanation, that'd be perfect. Thanks so much!

2

There are 2 best solutions below

0
On BEST ANSWER

This is one of those things that is hard to give a single answer on. At the end of the day there are modeling drawbacks indeed, but there are so many theoretical advantages that it is rather hard to resist.

I think the best point I can make is the one I will make first: absolute differences (probably the most obvious idea to consider) have a serious problem. Their minimizer is frequently not unique. For instance, if we just look at "constant trendlines" (so we force the slope to be zero), a data set with an even number of data points, all distinct, has a whole continuum of horizontal lines that minimize the absolute vertical difference. These are precisely the medians, and they consist of all numbers between the $(N/2)$th data point and the $(N/2+1)$th data point.

This non-uniqueness effect is closely tied to a sensitivity effect: adding a data point frequently changes the character of the minimizer drastically. For example, if you begin with an even number of data points and your $(N/2)$th data point and $(N/2+1)$th data points are far apart, then you might say "OK, I'll just pick the middle point between them as my median and be done with it". (After all, this is what we were taught in pre-algebra or whatever, and it makes a lot of intuitive sense.) If you now add a data point which is not between these two points, the median is suddenly one of those two points, which is very far away from your previous median, even if $N$ is very large. It's not far away from the whole set of previous medians, but we don't want to have to think in terms of a whole set of medians. (On the flip side, for medians, this is the worst thing that can happen. For least squares, a sufficiently severe outlier can have an even more dramatic effect.)

Linear least squares has neither of those problems: under a simple "non-degeneracy" assumption, linear least squares has a unique solution, and the sensitivity of this solution to the addition of a new data point decays with the size of the data set. This is often also true of nonlinear least squares.

Other advantages:

  • The actual problem of linear least squares is a pure linear algebra problem: it can be written in terms of the normal equations $A^T Ax=A^T b$, or the QR formulation $R x = Q^T b$, or the SVD formulation $x=A^\dagger b$ where $A^\dagger$ is the Moore-Penrose pseudoinverse. This is not true for any power other than $2$.
  • The function of interest in linear least squares has all the types of convexity you could ever want. The function of interest in the absolute difference version is not even strictly convex.
  • We have the Gauss-Markov theorem: if we assume that the dependent variable data that we observe is given by a linear function of the independent variables plus independent normal random variables with mean zero and some fixed variance, the best unbiased linear estimator of the unknown parameters is the ordinary least squares estimator. In view of the central limit theorem, this means that if our errors each arise from a large number of small, uncorrelated (or even weakly correlated) contributions, then the ordinary least squares estimator is still a very good estimator within the class of unbiased linear estimators.
0
On

Yes, you are right. What you are describing is what referred to in machine literature as non-robustness of LSE to outliers. The reason for this name is that, if you have a noisy point in your data, far away from the rest, the squared nature of the penalization tends to move the whole model towards that single noisy outlier.

A more robust alternative to mean squares error is L1 error (absolute error). The solution to that loss is a linear program, and hence is efficient to solve. However, it is not as efficient and straightforward as solving least squares and for those who care, the bad news is that it does not have a closed-form solution. (It is not clear why should anyone care though)