Analog of Bayesian evidence for "optimization stability" of the statistical model?

Question

Analog of Bayesian evidence for "optimization stability" of the statistical model?

23 Views Asked by Bumbble Comm At 28 Mar 2026 - 4:34

If we are choosing between models $M \in \{A, B\}$, we could compute evidence for each of them

$$P(D|M) = \int P(D|\theta, M) P(\theta|M) d\theta$$

and then argue that the model that maximizes it is superior.

I have several complex models that are hard to optimize and I look at a sequence of values $P(D|\theta_t, \alpha, M) \ , \ t \in [0, T)$ for each model $M$ with hyperparameters $\alpha$ over many iterations. Maximum evidence in principle says that I want to have reasonable data posteriors for any choice of $\theta$ for my model $M$, is there some kind of value that states that even for higher values of $t$ for any choice of $\alpha$ if $\theta$ is optimized according to gradient descent, evidence is still okay. Some bastard child of Lyapunov stability and evidence.

I mean, of course I can come up with one myself (like, integral of evidence over time), but that most likely be pretty useless, so I wonder if there is some already established measure of stability of statistical models like this.

To make it clear, I am looking for a value that I am going to estimate empirically, I am not building some sort of theoretical derivation regarding properties of these models.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2017-12-16 22:03:15

If you take $\hat{\theta_M} = \max_{\theta_t}P[D|\theta_t,\alpha,M]$ you would just be getting the MLE for $\theta$ under model $M$.

If I followed what you were saying, you were suggesting that you could approximate the evidence by $P[D|\hat{\theta_M},M]$. Supposing that the data did come from model $M$, then the MLE is asymptotically consistent for $\theta^*$, the true parameter value, and would give that model a large amount of weight.

Supposing that you estimated $\hat{\theta_M}$ for a model $M$, with density $f_M(x|\theta)$, which was incorrect, with data distributed according to some other distribution with density g: $D \sim g$.

In that situation, the model converges to the parameter that minimizes the KL divergence from the true model. That is

\begin{equation} \hat{\theta}_M = \min_{\theta} KL(g||f_M(x|\theta)) = \min_{\theta} E_g[\textrm{log}\frac{f_M(x|\theta)}{g(x)}] = \min_{\theta} E_g[\textrm{log}f_M(x|\theta)] \end{equation}

Asymptotically speaking, because as $n\rightarrow \infty $ $P[||\hat{\theta}_M -\theta^*||\geq ||\hat{\theta}_{MLE} -\theta^*||]\rightarrow 1$ you would expect that for large enough $n$ if $M_1$ is the true model, and you estimate $\hat{\theta}_{M_1}$ and $\hat{\theta}_{M_2}$, then $P[D|\hat{\theta}_{M_1},M_1] > P[D|\hat{\theta}_{M_2},M_2]$ because the KL divergence is only totally minimized when the probability distributions are equal.

Ignoring finite sample stuff, I guess a good question to ask would be if $P[\max_{M}P[D|\hat{\theta}_M,M] = M^*]\overset{P_{M^*}}{\rightarrow}1$. That is, if you'll get the right model with an infinite amount of data.

Analog of Bayesian evidence for "optimization stability" of the statistical model?

There are 1 best solutions below

Related Questions in STATISTICAL-INFERENCE

Related Questions in STABILITY-THEORY

Trending Questions

Popular # Hahtags

Popular Questions