What is the justification for the "K-L model prior" that makes AIC appear as a Bayesian result

21 Views Asked by At

When trying to understand AIC, BIC, and model selection in general, I came across a paper that states "AIC can be justified as Bayesian using a 'savvy' prior on models that is a function of samplesize and the number of model parameters." [1]

I'm not an expert in the field, so I don't know how to tell good research from bad. However, I do see this paper has thousands of citations. It's referenced from the Wikipedia article on AIC. By all of the metrics that this outsider can judge, it seems legit.

The authors of this paper note that BIC assumes each model in the candidate set (with size $R$) has equal prior probability $\frac{1}{R}$. They argue this assumption is unfounded.

Then they introduce a different function for prior model probability that generates a posterior model probability that is consistent with AIC. Here is that prior:

$$ q_i = C \cdot \exp(\frac{1}{2} K_i \log(n) - K_i) $$

where

$$ C = \frac{1}{ \sum_{r=1}^R \exp(\frac{1}{2} K_r \log(n) - K_r) } $$


This "K-L Model Prior" has a strong preference for models with more parameters.

Let's look at an example. (This example was not part of the original paper, so you might want to read this section with increased skepticism.)

Let's say we have two models, one with $K=2$ parameters and the other with $K=10$.

The ratio of the prior probability of the simpler and the more complex model are: $$ \frac{C \cdot \exp(\frac{1}{2} 2 \log(n) - 2)}{C \cdot \exp(\frac{1}{2} 10 \log(n) - 10)} \\ $$ which reduces to $$ \frac{e^8}{n^4} $$

The $K=2$ model and $K=10$ model are equally likely around $n=7$ data points. As the number of data points $n$ increases, confidence in the $K=10$ model grows very quickly. By the time we reach $n=100$, we have a 99.997% prior belief that the $K=10$ model is the better of the two.


The authors imply that this "savvy prior" is better than assuming all models have equal prior probability.

Why is that? Can this prior be justified aside from the fact that it generates AIC as a Bayesian result?


[1] Burnham, Kenneth P., and David R. Anderson. “Multimodel Inference: Understanding AIC and BIC in Model Selection.” Sociological Methods & Research 33, no. 2 (November 2004): 261–304. doi:10.1177/0049124104268644.

https://faculty.washington.edu/skalski/classes/QERM597/papers_xtra/Burnham%20and%20Anderson.pdf