Suppose I hired a man, let's call him a consultant, whom I sometimes ask to estimate a probability of some events. Then I witness whether those events occur or not. Over time, I have an array of pairs $(q_i, b_i)$, which is a predicted probability and a $0|1$ value, which says whether the event really occurred. Now I want to evaluate how well the consultant guesses those probabilities. I don't know how to do that, so my problem is to find a scoring function, which would give the higher expected score the better the consultant guesses probabilities.
I will try to formalize the idea.
I need a function $F(\{(q_i,b_i)\}_{i=1}^n)$ with a following property:
Suppose I arbitrarily choose numbers $p_i$, and let's say that $v_i$ are independent Bernulli random variables with parameters $p_i$.
Define $G(\{q_i\}) = \mathbb{E}F(\{(q_i,v_i)\})$.
I want $G$ to achieve maximum at $q_i=p_i \forall i$. Moreover, I want $G$ to grow if any argument $q_i$ is moved closer to $p_i$.
Thoughts: if I change the requirement in such a way that all the $p_i$ are equal, then it seems that the best option for $F$ would be to average all $b_i$ getting an estimate of $p_i=p$, and then summarize distances from $q_i$ to this estimated $p$. That's pretty straightforward, but that's not what I want.
Maybe my requirement is too hard, so I would be happy with any other method of scoring the predictions, which would favor guessing the correct probablity at any moment.
Addendum. I tried to work the following idea: for all subsets of observations compare average $b_i$ with average $q_i$, getting $2^n-1$ partial scores. Then combine them in some convoluted way such that getting any estimation wrong on purpose cannot benefit the consultant. But firstly, this is random, just a thought, and secondly, I couldn't get anything with it because the analysis seems too difficult.
The functions
$$ F_1=\sum_i\left(b_i(1-q_i)^2+(1-b_i)q_i^2\right) $$
and
$$ F_2=-\sum_i\left(b_i\ln q_i+(1-b_i)\ln(1-q_i)\right) $$
have the properties that you're looking for. As you noted in a comment, you can write them more succinctly as
$$ F_1=\sum_i(b_i-q_i)^2 $$
and
$$ F_2=\sum_i\ln|1-b_i-q_i|\;. $$
The expectation of each term is minimized at $q_i=p_i$. The minimal expected values are
$$ \min_{\{q_i\}}\mathbb E_{\{b_i\}}(F_1) = \sum_ip_i(1-p_i) $$
and
$$ \min_{\{q_i\}}\mathbb E_{\{b_i\}}(F_2) = -\sum_i\left(p_i\ln p_i+(1-p_i)\log(1-p_i)\right)\;, $$
respectively. They depend on the $p_i$; the individual terms vary from $0$ to $\frac14$ and from $0$ to $\ln2$, respectively. Thus, without knowing the $p_i$, these functions can only be used to compare different consultants, not to assess a single consultant, since we don't know the optimal score.
In fact, this is true for any similar scoring method. Consider an arbitrary function $f(x)$ and define
$$ F=\sum_i(b_if(q_i)+(1-b_i)f(1-q_i))\;. $$
The first-order condition for the minimum expected value of each term to be at $q_i=p_i$ is
$$ p_if'(p_i)-(1-p_i)f'(1-p_i)=0\;. $$
To be able to interpret the score without knowing the $p_i$, we'd want the minimal expected score to be independent of $p_i$, so we'd want
$$ p_if(p_i)+(1-p_i)f(1-p_i) $$
to be constant. But setting the derivative to $0$ and using the first-order condition leads to $f(p_i)=f(1-p_i)$, and substituting that into the first-order condition shows that it would only be fulfilled for constant $f$ (which is obviously not a solution).
P.S.: I just realized that this isn't just a shortcoming of this particular approach – there can't actually be any method that allows us to assess a single consultant without information about the $p_i$.
Consider two cases as an example. In the first case, all the $p_i$ are $\frac12$. In the second case, the $p_i$ are chosen uniformly randomly from $\{0,1\}$. If we know only the $b_i$, these cases are completely indistinguishable to us. A consultant who always predicts $q_i=\frac12$ would be dead right in the first case, and completely clueless in the second case, and we'd have no basis on which to distinguish these two very different performances.
Of course, we don't need complete knowledge of the $p_i$ to assess a single consultant; for instance, if the $p_i$ are i.i.d. random variables, then it's enough to know their distribution, since integrating it with the minimal expected score would give us the expected score of a perfect consultant.
Thanks for asking this question! I've sometimes wondered how such an assessment might be done, but I never thought it through to this point.