I recently came up with an idea of this problem, spent some time trying to solve it, and I'll appreciate your help finding a solution :)
A scientist conducts an experiment in quantum physics. The experiment consists of $2$ phases, and during first phase he collects some data. He has several theories, for prediction of probability of some event happening during phase $2$ of experiment. So, scientist does the following:
- performs phase $1$ of experiment. Scientist does not have any control over data he receives and uses for prediction - they are completely random.
- calculates probabilities of an event using different theories, getting different results (i.e. theory $1$ says that probability of event is $20\%$, and theory $2$ – $35\%$, etc).
- performs phase $2$ of experiment, and event does happen or does not.
Can scientist, by repeating the experiment large amount of times, determine which theory is closer to reality?
I've tried some obvious solutions, but they are prone to errors like "theory that always says $0.5$ - always wins".
Yes you can -- this is done in machine learning when you want to assess how well calibrated the probabilities are from a classifier or regression model.
A common metric is the Brier Score (and wiki):
$$BS = \frac{1}{n}\sum_{i=1}^n (p_i - O_i)^2$$
Where $p_i$ is the predicted probability for the experiment and $O_i$ is the outcome of the experiment (with $0$ for "didn't happen" and $1$ for "did happen").
The lower the score the better as your model's probabilities line up better with the frequency of 1's and 0's.
Of course, with your 2 phase approach you will need to be sure you are also being careful in Phase 1 that the parameter estimation approaches needed for each model is equally good. For example, $F_g = \frac{GmM}{r^2}$ is the correct classical law of gravity and whereas $F_g = \frac{(mM)^{\gamma}}{r}$ is completely wrong. However, if we are really bad at measuring radii or its easier measure $\gamma$ precisely, then we may end up in a situation where we have better results with a wrong model because we can fit it better vs it actually being more correct.