What happens when we minimize the sum of errors instead of the sum of errors squared?

252 Views Asked by At

Basically title says it all. Say we run a regression where instead of OLS we try to minimize the errors alone. Is this even possible. When I try to differentiate with respect to $\beta_0$ for the FOC of $min$ $S = \sum_{i=1}^N y_i-\beta_1x_i-\beta_0$, I simply get $ \sum_{i=1}^N-1=0$. Which is clearly false. Is it possible to minimze this sum generally without knowing the actual values?

2

There are 2 best solutions below

0
On

Yes, it is possible, except one usually minimizes the sum of absolute errors, $\sum_i \left|y_i - \beta_1x_i -\beta_0\right|$, to avoid the situation where positive and negative deviations cancel each other out. This $L_1$-regression problem can be reformulated as a linear program and solved efficiently. The result will be different from the estimator obtained from OLS, and will typically be less vulnerable to outliers.

An intuitive way to understand the difference is as follows: if you remove the variable $x$ and simply fit $\beta_0$ to the data $y_i$, OLS will yield the mean of $y_i$ while $L_1$-regression will yield the median.

1
On

Suppose we have the data $$((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)).$$ We want to fit a line $\hat y = \hat \beta_1 x + \hat \beta_0$ to this data for some constants $\hat \beta_0, \hat \beta_1$. Then for each $x_i$, this gives us an estimate $\hat y_i = \hat \beta_1 x_i + \hat \beta_0$, and the residual is $$\epsilon_i = y_i - \hat y_i,$$ that is, it is the difference between the observed response $y_i$ and the fitted model response $\hat y_i$. The sum of the errors is $$\sum_{i=1}^n \epsilon_i = \sum_{i=1}^n y_i - \hat y_i = \sum_{i=1}^n y_i - \hat \beta_1 x_i - \hat \beta_0.$$ With some minor exceptions in notation, this is essentially what you wrote. But here is where it gets strange. If you expand this sum, you get $$-n \hat \beta_0 + \sum_{i=1}^n y_i - \hat \beta_1 \sum_{i=1}^n x_i.$$ The terms $\sum y_i$ and $\sum x_i$ are fixed constants that do not depend on the coefficients. So if I want this sum of errors to equal zero, there are two free variables for this single equation, and there are in general infinitely many solutions as a result. To see it more clearly, let us write $n \bar x = \sum_{i=1}^n x_i$ as the sample total for the $x_i$, and $n \bar y = \sum_{i=1}^n y_i$ as the sample total for the $y_i$. So now we have the condition $$0 = -n \hat \beta_0 + n \bar y - n \bar x \hat \beta_1,$$ and after dividing everything by $n$ and moving a few things around, we get $$\hat \beta_0 = \bar y - \bar x \hat \beta_1. \tag{1}$$ So I could choose $\hat \beta_0 = 0$, $\hat \beta_1 = \frac{\bar y}{\bar x}$ if I wanted to. Or I could choose $\hat \beta_0 = \bar y$ and $\hat \beta_1 = 0$. Not only can I always make the error zero, in general there are infinitely many choices for the coefficients. This is obviously not meaningful.

Why does this happen? Well, think about what is happening geometrically. $\hat \beta_0$ is the intercept of the fitted line, and $\hat \beta_1$ is the slope. Thus the condition $(1)$ says that for any slope I choose for the fitted line, even if it doesn't trend with the data, I can simply move the intercept up or down so that the sum of errors cancels out. And this makes sense because the residuals are signed; i.e., if the observed value is greater than the fitted, it is positive, and if the observed is less than the fitted, it is negative. So by moving the line up while keeping the slope the same, I am monotonically decreasing the sum of residuals, and by moving the line down, I am monotonically increasing the sum of residuals. Because the sum of errors is a continuous function of these residuals, and in turn the choice of intercept, it follows that there exists a choice of intercept where the sum of errors changes sign--i.e., is zero.

This is why we do not minimize $|S|$, but instead, the correct approach would be to minimize $$\sum_{i=1}^n |y_i - \hat y_i|,$$ the sum of absolute deviations, because this way, we do end up with a unique answer. However, the problem with this sum is that it is not in general a smooth function. The familiar sum of squares $$\sum_{i=1}^n (y_i - \hat y_i)^2$$ remedies the smoothness problem.