In least squares regression, we try to minimize the sum of the squares error terms. I was wondering if this would unfairly penalize a model for having terms that are too far away. For example, a term that is $1$ unit away from the line of best fit would have $1^2=1$ effect on the MSE, but a term that is $2$ units away would have $2^2=4$ effect on the MSE.
It seems to me that the term that is 2 units away seems to have 4 times as large an influence on the "performance" of a linear model compared with the first term.
I was wondering why this was the case, and if there's some intuitive explanation for why this is so. Is the square used because cubics, quartics, etc, are harder to work with? I'd imagine it's easier to optimize for a square than an absolute value. Is there some fundamental reason for why this is so or is it convention?
If anybody would have an intuitive explanation, that'd be perfect. Thanks so much!
This is one of those things that is hard to give a single answer on. At the end of the day there are modeling drawbacks indeed, but there are so many theoretical advantages that it is rather hard to resist.
I think the best point I can make is the one I will make first: absolute differences (probably the most obvious idea to consider) have a serious problem. Their minimizer is frequently not unique. For instance, if we just look at "constant trendlines" (so we force the slope to be zero), a data set with an even number of data points, all distinct, has a whole continuum of horizontal lines that minimize the absolute vertical difference. These are precisely the medians, and they consist of all numbers between the $(N/2)$th data point and the $(N/2+1)$th data point.
This non-uniqueness effect is closely tied to a sensitivity effect: adding a data point frequently changes the character of the minimizer drastically. For example, if you begin with an even number of data points and your $(N/2)$th data point and $(N/2+1)$th data points are far apart, then you might say "OK, I'll just pick the middle point between them as my median and be done with it". (After all, this is what we were taught in pre-algebra or whatever, and it makes a lot of intuitive sense.) If you now add a data point which is not between these two points, the median is suddenly one of those two points, which is very far away from your previous median, even if $N$ is very large. It's not far away from the whole set of previous medians, but we don't want to have to think in terms of a whole set of medians. (On the flip side, for medians, this is the worst thing that can happen. For least squares, a sufficiently severe outlier can have an even more dramatic effect.)
Linear least squares has neither of those problems: under a simple "non-degeneracy" assumption, linear least squares has a unique solution, and the sensitivity of this solution to the addition of a new data point decays with the size of the data set. This is often also true of nonlinear least squares.
Other advantages: