When to extrapolate?

262 Views Asked by At

I know extrapolating is probably dependent on the question being asked. But are there some guidelines on when extrapolation is "okay"? If you can't be as certain that a set of data will maintain a pattern, how can you say extrapolation is reliable? If it's not reliable how can you say it's useful? Do businesses make decisions off extrapolation?

Sorry that was a lot of questions.

1

There are 1 best solutions below

0
On BEST ANSWER

In statistics, the word extrapolation is perhaps most often used in the context of regression. I will confine my Answer to extrapolation in simple regression. (Similar comments hold for multiple regression.)

Simple linear regression. Suppose you have $n$ pairs $(x_i, Y_i)$ modeled by the equation $$Y_i = \beta_0 + \beta_1 x_i + e_i,$$ where 'errors' $e_i \stackrel{iid}{\sim} \mathsf{Norm}(0, \sigma).$ Also suppose that data $(x_i, Y_i)$ have been used to make a least-squares line, estimating coefficients of the linear model $\beta_0$ by $\hat \beta_0$ and $\beta_1$ by $\hat \beta_1$ and the error variance $\sigma^2$ by $\hat \sigma^2.$

Prediction. Then, given a value $x_p,$ a corresponding value $\hat Y_p$ can be 'predicted' as $\hat Y_p = \hat \beta_0 + \hat \beta_1 x_p,$ and a 95% prediction interval (PI) can be found using also $\hat \sigma^2$ as in your textbook or on slide 11 of these notes, using slightly different notation.

If the value $x_p$ lies within the span of the $n$ values $x_i$ of the data, then the predicted value $\hat Y_p$ is called an interpolation. The prediction interval is shortest when $x_p \approx \bar x$ and becomes wider as $x_p$ becomes farther from $\bar x.$ [Notice the factor $(x_p - \bar x)^2$ in the third term under the square root in the formula for the prediction interval.]

However, if the value $x_p$ lies beyond the span of the $n$ values $x_i$ of the data (far to the right or far to the left), then the predicted value $\hat Y_p$ is called an extrapolation. Extrapolation is always somewhat risky. Of course, the formula for the PI gives much longer interval (indicating risk that is taken into account by the regression procedure.)

Added risks of extrapolation. But there is additional risk in extrapolation because the formulas for $\hat \beta_0, \hat \beta_1, \hat \sigma^2,$ and for the PI assume a linear model within the span of the data's $x_i.$ The true model may be a gently bent non-linear function that curves away from the estimated model as $x$-values move beyond the span of the data. This risk is not accounted for in any of the formulas.

A small 'chunk' of a gently bent curve may be 'almost-linear' so that regression diagnostics (such as plots of residuals) show no difficulty. But as we move away from that 'chunk' we may need a different linear model or a model that is distinctly non-linear. As you suggested in your Question, the 'pattern' may change if you extrapolate.

Extrapolation to the future. By their very nature, commercial, climatic, and economic forecasts of future events always involve extrapolation. Data for $x$'s in 2000-2017 may fit a linear model that no longer applies in $x_p = 2018.$ Wars, environmental events, elections, and so on may change the business or economic situation in unexpected ways. It is futile to criticize such extrapolation to the near future because (short of crystal balls and tarot cards) there is no alternative. But extrapolations must always be viewed with skepticism.

Famous quote (due to ancient Norse sages, Niels Bohr, Albert Einstein, or Yogi Berra, depending on your reading of the Internet): "Prediction is difficult, especially about the future."