How to gather useful information from a residue plot

210 Views Asked by At

You can usually see how good your linear regression line is by looking at the residue plot. If you see the points randomly distributed, you're good. But if you see a pattern, it means there is something wrong with your model; you perhaps need a quadratic function instead of a linear function. I have the following points:

(10,21)(20,12)(30,10)(40,8)(50,7)(60,5.9)(70,6.3)(80,6.95)(90,7.57)(100,8.27)(110,9.03)(120,8.87)(130,10.79)(140,11.77)(150,12.83).

You can immediately see it's not properly described by a linear function, but we use our calculator to find the best line anyway: We get

$$y = -0.129785714x + 10.61028571$$ with an abysmal $r^2 = 0,0209426384$

If we look at the residue plot we can clearly see the points distributed in a quadratic function. That is also expected from your data (y keeps getting lower, than keeps getting higher). I wish I could show you a picture of the residue plot, but I don't know how.

So what I tried is using my calculator's (Ti84+) QuadReg option. We get a way better model:

$$ R^2 \approx 0.77$$

From the data, I expected this to be the best model for our data. However, if I do QuarticReg ($x^4$) we get

$$ R^2 \approx 0.95$$

While the residue plot of the data doesn't look like a quartic function to me, at all. So the higher the power gets, the better the function gets. If you have $x^n$, $R^2$ asymptotically approaches 1 as $n$ tends to $\infty$. Or so it seems, from this small test.

My problem with this is the obvious one: Which model should you take? The model with the highest $R^2$ (Quartic in this case, since my calculator can't go any higher), or should you just look at how the points are spread out in a residue plot (of the linear regression line) and choose the function that most resembles it (In this case I'd say Quadratic, but you can judge for yourself using the data).

1

There are 1 best solutions below

3
On BEST ANSWER

In short, do not take the one with the highest $R^2$, your approach of looking at the plot is already very reasonable. The following is elaboration/explanation.

The model is estimated by ordinary least squares, I presume. In your quadratic case, you fit $$y_i=\alpha+\beta_1 x_i+\beta_2 x_i^2,$$ while in the cubic case, for example, you fit $$y_i=\alpha+\beta_1 x_i+\beta_2 x_i^2+\beta_3x_i^3.$$ The latter can always replicate the former by setting $\beta_3=0$. Hence, the model fit will always weakly improve as you add more variables (here: degrees of the polynomial), because the model selects the $\beta$'s such that the fit is improved. Your example of moving from the linear case to the quadratic case illustrates this.

Now what are you to do in terms of model choice? Obviously, you cannot choose a polynomial of infinite degree, because you do not have that many degrees of freedom in your data (unless you artificially generate it). There are several concerns that impact model choice:

1) Theory: you have certain data, what do you know about it? Do you have strong theoretical grounds to be only interested in, say, quadratic terms?

2) Test for significance: more parameters usually result in better fit, but that does not mean one has to include as many as possible. In the question between quadratic vs. cubic, you can test whether $\beta_3\neq0$. If you cannot reject the null that it is zero, then you might as well drop it, because it does not explain the data significantly better than the more parsimonious model. If you want to drop several variables, you have to test them jointly (e.g., test $\beta_3\neq 0$ AND $\beta_4\neq 0$, not sequentially). At some point a higher polynomial won't explain much more, so that's when you can stop.

3) Ease of interpretation: linear and quadratic models can be interpreted very nicely. In the linear case, you know whether $x$ and $y$ are positively or negatively related, in the quadratic case you know whether the effect increases or diminished with higher $x$. Interpretation gets harder and less clear for cubic terms and above. Hence, a quadratic model can simplify the data a bit, but you should not do this if you KNOW the relationship is not merely linear or quadratic (i.e., see plot as you did).

As a last rule of thumb, people rarely go above a third degree polynomial. One of the leading econometrics textbooks (Woooldridge - Introductory Econometrics) says one rarely needs more than quadratic or cubic terms. And finally, there is a phenomenom called overfitting, which basically says that if you fit the data too closely, you do not capture the underlying process but a lot of noise, and this does not predict well. Here is an example, where the authors fit a 4th and 25th degree polynomial to weather data, and the former predicts better (Fig. 3).