Formal proof of Occam's razor for nested models

48 Views Asked by At

I consider 2 models $M_0$ and $M_1$, $M_1$ being more complicated than $M_0$ in the sense that it has more parameters (I usually assume than $M_0$ is nested within $M_1$). They are respectively parametrized by $\theta_0$ and $\theta_1$. I assume that

  1. $\theta_0 \subset \theta_1$ (i.e. $M_1$ has the same parameters as $M_0$ plus extra parameters)
  2. $p(\theta_0|M_1) = p(\theta_0|M_0)$ (both models have the same priors for the parameters they have in common)

I would like to prove the following inequality:

$$\forall \theta_0 \\ \langle \log p(\mathcal{D | M_0}) \rangle _{p(\mathcal{D | \theta_0, M_0})} \geq \langle \log p(\mathcal{D | M_1}) \rangle _{p(\mathcal{D | \theta_0, M_0})}$$

i.e. that on average, if my data $\mathcal{D}$ are generated from $M_0$ parametrized with a given $\theta_0$, then the Bayes factor is going to favor $M_0$ over $M_1$.

Has it already been done ? Intuitively, it is an application of Occam's razor (a simpler and true model will be favored over a more complicated one), but I lack a formal proof.

Precision on the notations : $p(\mathcal{D}|M_0,\theta_0)$ is not the same as $p(\mathcal{D}|M_0)$, and I thus cannot use the positivity of the Kullback-Leibler divergence. In "$M_0,\theta_0$", I specify both the model and its parameters. In "$M_0$", I only specify the model. $p(\mathcal{D}|M_0,\theta_0)$ is the probability that the data $\mathcal{D}$ are generated from model $M_0$ with parameters $\theta_0$, while $p(\mathcal{D}|M_0)$ is the marginal likelihood over all parameters (the one we use to compute the Bayes factor) : $\int_{\theta} p(\mathcal{D}|M_0,\theta)p(\theta|M_0)$ where $p(\theta|M_0)$ is the prior of parameters under model $M_0$.

The question has previously been asked on Cross-Validated here but without an answer so far.