Logistic Regression with a Time Factor for Forecasting

87 Views Asked by At

I have a general question with regards to modelling the chance that domestic flights within the U.S. are cancelled or not using logistic regression (which I'm relatively new to).

I have used 'R' to fit a logistic regression model to a vast data set comprising of millions of flights over the past five years (2014 - 2018) using the following factors:

  1. Year
  2. Month
  3. Day of the week
  4. Origin airport
  5. Destination airport

Now, the output of the GLM analysis shows that all of these factors are significant. In particular, the Year variable is likely key as it may give insight into the 'severity of weather' in that year, or 'mix of flights' in that year.

However, my issue now is that I want to use these results to predict the chance that a given flight (if I know the date of departure, and origin / destination) will be cancelled. I want to do this for flights occurring in 2019.

Of course, the 'Year' factor is now no use to me, as it is only used to predict cancellations for historic years. And i'm not sure how to interpret the factors given that I am saying that a significant factor is now of no use.

Perhaps one option is to assume that 2019 will 'behave like 2018' and hence, when using the GLM for forecasting, I ignore the year variable, but assume that the '2018' factor is the correct factor to use? And 'leave' all other factor values as they are?

Or, perhaps a better option is to completely ignore the Year factor from my GLM, and run it again without the Year variable? (But then, i'm not sure if the new factors for the other variables will adjust in a way that makes sense for forecasting 2019)?

Sorry if that's a little broad and vague, but any insight into this problem and general thought processes on how best to model this would be greatly appreciated.

Thanks

1

There are 1 best solutions below

0
On

It is not clear whether the year "factor" is used as a set of dummy variables, one for each year, or a single continuous variable (like a linear year trend). I am assuming it is the former.

In that case, how is your statistical model supposed to predict anything year specific for 2019 if you have no observations for 2019 and hence cannot estimate the year 2019 dummy? It won't work. There are several alternatives from here:

  1. Do not estimate year specific effects. Just drop "year". This will (possibly greatly) reduce your model fit and predictive accuracy, but now you are not asking the model to predict something it cannot.

  2. Estimate a year trend, for example a linear or a quadratic one (check your estimates whether there is, in fact, a discernable trend, otherwise this will not do much good in out of sample prediction). Then, for example if cancelation rates increased year by year, the model will take that into account when predicting cancelations in 2019.

  3. As you mention, you could assume that years 2019 and 2018 do not differ. Then you would just use the estimate for the 2018 dummy to predict for 2019. However, this is completely ad hoc, not backed by data, and ultimately guesswork.

In terms of prediction accuracy, I would expect option 2 to perform best.

By the way, you can test that by dividing your sample in "training data" and "test data": Use only a few years in your sample to estimate the model, then predict the outcomes (cancelations) for the remaining years in your sample. Then compare the predictions and actual outcomes. Whatever model performs best has the best out-of-sample prediction accuracy.

If you want to do this seriously, look into cross-validation.