Situation/Question
I have a set of stochastic variables $S_X=\{X_i | i \in 1..n\}$. $X_i$ can take on either $1$ or $0$ as value. Both $P(X_i=1)$ and $P(X_i=0)$ are unknown but we have a model $M$ that makes a prediction for $P(X_i=1)$ being $\hat{P}_M(X_i=1)$.
I'm trying to find a metric to evaluate how accurate my model's predictions are. The tricky part is that I only have $1$ sample for each stochastic variable $X_i$.
My approach
My first approach was to try and get more samples by dividing $S_X$ into classes based on the prediction made by $M$. The assumption being that if I have a class defined by the subset $I$ (of $1..n$) for which:
- All chances in $\{\hat{P}_M(X_i=1) | i \in I\}$ are roughly equal to a constant $c$
- This constant $c$ is roughly equal to the expected value (approximated by the samples) of $\{X_i | i \in I\}$.
(In math notation $\frac{\sum_{i \in I} X_i}{size(I)} \approx c$)
But this method can fail fairly easily. Take for example $P(X_k=1)=0.8$, $P(X_l=1)=0.2$ and $M$'s predictions $\hat{P}_M(X_k=1)=0.5$, $\hat{P}_M(X_l=1)=0.5$. Using the method above I would conclude $M$ is very accurate while it is clearly not.
I'm now starting to doubt if what I'm asking is even possible or if I need some extra information. Is my initial approach the best possible? How would you solve this problem?