How to evaluate probability estimators with only external information?

Question

How to evaluate probability estimators with only external information?

1.1k Views Asked by Bumbble Comm At 13 Apr 2026 - 12:33

Here's a problem that I have pondered over many times without ever coming to a satisfactory solution:

Let's say that we have a series of random events: V(i) for I = 1 to n. Each of these events will have a result VR(i) of either 0 or 1, and a probability VP(i) (from zero to one) that represents the probability that VR(i) = 1.

We also have a group of k estimators: E(j) for j = 1 to k, that generate EP(j,i) which are estimates of VP(i).

What I want is an evaluation function F(j) that given EP(j,i={1..n}) (that is, E(j)'s estimates of VP(i) for i = 1 to n), and VR(i={1 to n}) (i.e., the actual results), that will return a number that is a "good" valuation of E(j)'s ability to estimate VP(i).

Some Notes:

The results, VR(i), are known to the Evaluation function, but (obviously) not to the Estimators, neither before or after each result (so the estimators cannot use the VR(i)'s to adjust their subsequent predictions, if that matters).
The probabilities, VP(i), are not known to the Evaluation function.
The distribution of the probabilities VP(i) is not know to the Evaluation function.
EP(j,i) is supposed to be an estimate of the event probabilities, VP(i), and not an estimate of the results, VR(i). Virtually every valuation system that I have seen tends to weight each EP(j,i) solely on its closeness to VR(i), which invariably rewards "polarized" estimators: those that always return 1 if VP(i) > 0.5 and 0 if VP(i) < 0.5.

One hallmark of the problem in #4 is that if VP(i) = 0.5, then these types of valuation systems will tend to reward all EP(j,i)'s the same, or else give the highest reward to EP's of 0 or 1, and the worst valuation to the actual correct estimate, 0.5. What I would like, of course, is just the opposite: for estimates of 0 or 1 for actual probabilities of 0.5 to receive the worst valuation while the correct estimates of 0.5 to be rated the highest.

This I think is also the essence of the difficulty in this problem: how do I correctly value the probability estimates (and estimators) if I only know the event results, but not the actual probabilities being estimated?

Original Q&A

There are 3 best solutions below

Bumbble Comm On 05 Jan 2011 - 6:26

I have thought about this a bit as well. As you do, let n be the total number of events. The best I can think of is to bin the EP(j,i) for each j in bins by i. Say for predictor j you bin the lowest 10% of the predictions. You can then calculate how many of those events should have happened and a variance on the number. If the lowest 10% of the predictions were all 5%, say, you should have 5% of these n(10%) events happen with a variance of n(10%)(5%)(95%). If a higher bin has 30% of its predictions 25% and 70% of its predictions 30%, you would expect 28.5% of those events to happen and can figure the variance. You can then check the chi-square of all the predictors.

I think this is in the direction you want. It can certainly give an answer, but whether it is justified I'm not sure.

Bumbble Comm On 06 Apr 2011 - 3:05

Let me rephrase the problem a bit, more generally first:

We have a sequence of (independent but not identically distributed) random variables $x_i$, each with its probability function $p_i(x)$. We are given estimates of the probability functions $\hat{p}_i(x)$, for each $i$, and we want to evaluate how good this estimator is; but we don't know $p_i(x)$, we only have the values of random variable $x_i$.

In particular, if $x_i$ is a Bernoulli variable (0/1 values), its probability function is specified by the parameter $p_i = Prob(x_i = 1)$. The estimated probability, then, is given by $\hat{p}_i$

I guess this must be a well known problem (see the example I give in the comment), but I don't know if there is a solution. Ross' suggestion seems one reasonable way to me.

I've thought of the following alternative, which could be simpler, specially if one is only interested in a relative evaluation (i.e., evaluating if one estimator A is better than other estimator B).

In the general statement: if one takes $\hat{p}_i(x)$ as an estimator of the true probabilty, one can build an predictor $\hat{x}_i$ of the value $x_i$. For example, is well known that the expectation is the predictor that minimizes the mean squared error. Under some reasonable assumptions then we can expect, I believe, a better performance of the predictor (less square error) if $\hat{p}_i(x)$ is close to $p_i(x)$ (see note below, though).

In the particular case: the expected value is just $\hat{x}_i = \hat{p}_i$ Then, the recipe would be: compute the mean square error of predicion as $e_i = (x_i - \hat{x}_i)^2 = (x_i - \hat{p}_i )^2$ , and compare its average over every $i$. The best estimator is that which produces the minimum error.

Note: here we are assuming that the person that gives us the estimator $\hat{p}_i$ does not known $x_i$ in advance: that would be cheating, and in that case the problem would be clearly intractable.

Updated (2011/04/14): After reading this, I became aware of the Brier score. It's quite similar to what I proposed above -and that made my day.

**Bumbble Comm** · Accepted Answer

You want an evaluation function that "rewards" correct probability estimates. This means it should be maximized when it reports the true probability. You can't do this without knowing the true probability, but you can do this "in expectation".

We want the expectation of a rating $F$ to be maximized when $EP(i) = VP(i)$. Dropping indices, this is maximizing:

$$ \langle F(EP, VR) \rangle_{VP} $$

We should be symmetric under interchange of predicting and not predicting, which means that $F(1-x, 0) = F(x, 1)$, so we can let $G(x) = F(x, 1), and F(EP, VR) = G(1-EP)*(1-VR) + G(EP)*VR$.

This turns the quantity to maximize at EP=VP into:

$$ \langle G(1-EP)*(1-VR) + G(EP)*VR \rangle_{VP} = G(1-EP)*(1 - VP) + G(EP)*VP $$

We need a nice function $G$ such that the derivative of that expression is zero at EP = VP, and that the second derivative is negative. It turns out that taking $G(x) = - \log(x)$ works nicely.

The derivative is $$-(1 - VP)/(1-EP) + VP/EP = 0 = EP(1 - VP) - VP(1-EP) = EP - VP,$$ which is indeed 0 at $EP = VP$.

The second derivative is also less than zero throughout the entire range:

$-(1 - VP)/(1-EP)^2 - VP^2 / EP^2 < 0$.

So, try evaluating with $\log(EP)$ or $\log(1 - EP)$ depending on whether the event happened or not.

(This can be thought of in terms of some the relative entropy between the two distributions. The relative entropy is $\sum_i p_i \log \frac{p_i}{q_i} = -H(p) + \sum_i -p_i \log q_i$)

How to evaluate probability estimators with only external information?

There are 3 best solutions below

Related Questions in PROBABILITY

Related Questions in PROBABILITY-THEORY

Related Questions in APPROXIMATION

Trending Questions

Popular # Hahtags

Popular Questions