If we are choosing between models $M \in \{A, B\}$, we could compute evidence for each of them
$$P(D|M) = \int P(D|\theta, M) P(\theta|M) d\theta$$
and then argue that the model that maximizes it is superior.
I have several complex models that are hard to optimize and I look at a sequence of values $P(D|\theta_t, \alpha, M) \ , \ t \in [0, T)$ for each model $M$ with hyperparameters $\alpha$ over many iterations. Maximum evidence in principle says that I want to have reasonable data posteriors for any choice of $\theta$ for my model $M$, is there some kind of value that states that even for higher values of $t$ for any choice of $\alpha$ if $\theta$ is optimized according to gradient descent, evidence is still okay. Some bastard child of Lyapunov stability and evidence.
I mean, of course I can come up with one myself (like, integral of evidence over time), but that most likely be pretty useless, so I wonder if there is some already established measure of stability of statistical models like this.
To make it clear, I am looking for a value that I am going to estimate empirically, I am not building some sort of theoretical derivation regarding properties of these models.
If you take $\hat{\theta_M} = \max_{\theta_t}P[D|\theta_t,\alpha,M]$ you would just be getting the MLE for $\theta$ under model $M$.
If I followed what you were saying, you were suggesting that you could approximate the evidence by $P[D|\hat{\theta_M},M]$. Supposing that the data did come from model $M$, then the MLE is asymptotically consistent for $\theta^*$, the true parameter value, and would give that model a large amount of weight.
Supposing that you estimated $\hat{\theta_M}$ for a model $M$, with density $f_M(x|\theta)$, which was incorrect, with data distributed according to some other distribution with density g: $D \sim g$.
In that situation, the model converges to the parameter that minimizes the KL divergence from the true model. That is
\begin{equation} \hat{\theta}_M = \min_{\theta} KL(g||f_M(x|\theta)) = \min_{\theta} E_g[\textrm{log}\frac{f_M(x|\theta)}{g(x)}] = \min_{\theta} E_g[\textrm{log}f_M(x|\theta)] \end{equation}
Asymptotically speaking, because as $n\rightarrow \infty $ $P[||\hat{\theta}_M -\theta^*||\geq ||\hat{\theta}_{MLE} -\theta^*||]\rightarrow 1$ you would expect that for large enough $n$ if $M_1$ is the true model, and you estimate $\hat{\theta}_{M_1}$ and $\hat{\theta}_{M_2}$, then $P[D|\hat{\theta}_{M_1},M_1] > P[D|\hat{\theta}_{M_2},M_2]$ because the KL divergence is only totally minimized when the probability distributions are equal.
Ignoring finite sample stuff, I guess a good question to ask would be if $P[\max_{M}P[D|\hat{\theta}_M,M] = M^*]\overset{P_{M^*}}{\rightarrow}1$. That is, if you'll get the right model with an infinite amount of data.