Here's a problem that I have pondered over many times without ever coming to a satisfactory solution:
Let's say that we have a series of random events: V(i) for I = 1 to n. Each of these events will have a result VR(i) of either 0 or 1, and a probability VP(i) (from zero to one) that represents the probability that VR(i) = 1.
We also have a group of k estimators: E(j) for j = 1 to k, that generate EP(j,i) which are estimates of VP(i).
What I want is an evaluation function F(j) that given EP(j,i={1..n}) (that is, E(j)'s estimates of VP(i) for i = 1 to n), and VR(i={1 to n}) (i.e., the actual results), that will return a number that is a "good" valuation of E(j)'s ability to estimate VP(i).
Some Notes:
The results, VR(i), are known to the Evaluation function, but (obviously) not to the Estimators, neither before or after each result (so the estimators cannot use the VR(i)'s to adjust their subsequent predictions, if that matters).
The probabilities, VP(i), are not known to the Evaluation function.
The distribution of the probabilities VP(i) is not know to the Evaluation function.
EP(j,i) is supposed to be an estimate of the event probabilities, VP(i), and not an estimate of the results, VR(i). Virtually every valuation system that I have seen tends to weight each EP(j,i) solely on its closeness to VR(i), which invariably rewards "polarized" estimators: those that always return 1 if VP(i) > 0.5 and 0 if VP(i) < 0.5.
One hallmark of the problem in #4 is that if VP(i) = 0.5, then these types of valuation systems will tend to reward all EP(j,i)'s the same, or else give the highest reward to EP's of 0 or 1, and the worst valuation to the actual correct estimate, 0.5. What I would like, of course, is just the opposite: for estimates of 0 or 1 for actual probabilities of 0.5 to receive the worst valuation while the correct estimates of 0.5 to be rated the highest.
This I think is also the essence of the difficulty in this problem: how do I correctly value the probability estimates (and estimators) if I only know the event results, but not the actual probabilities being estimated?
You want an evaluation function that "rewards" correct probability estimates. This means it should be maximized when it reports the true probability. You can't do this without knowing the true probability, but you can do this "in expectation".
We want the expectation of a rating $F$ to be maximized when $EP(i) = VP(i)$. Dropping indices, this is maximizing:
$$ \langle F(EP, VR) \rangle_{VP} $$
We should be symmetric under interchange of predicting and not predicting, which means that $F(1-x, 0) = F(x, 1)$, so we can let $G(x) = F(x, 1), and F(EP, VR) = G(1-EP)*(1-VR) + G(EP)*VR$.
This turns the quantity to maximize at EP=VP into:
$$ \langle G(1-EP)*(1-VR) + G(EP)*VR \rangle_{VP} = G(1-EP)*(1 - VP) + G(EP)*VP $$
We need a nice function $G$ such that the derivative of that expression is zero at EP = VP, and that the second derivative is negative. It turns out that taking $G(x) = - \log(x)$ works nicely.
The derivative is $$-(1 - VP)/(1-EP) + VP/EP = 0 = EP(1 - VP) - VP(1-EP) = EP - VP,$$ which is indeed 0 at $EP = VP$.
The second derivative is also less than zero throughout the entire range:
$-(1 - VP)/(1-EP)^2 - VP^2 / EP^2 < 0$.
So, try evaluating with $\log(EP)$ or $\log(1 - EP)$ depending on whether the event happened or not.
(This can be thought of in terms of some the relative entropy between the two distributions. The relative entropy is $\sum_i p_i \log \frac{p_i}{q_i} = -H(p) + \sum_i -p_i \log q_i$)