Given a number of sources of probability estimates, what is the best way to quantify the "best" source. For example, let's say I have two sources of estimates, A and B. If A says the probabilities are [.3, .5, .4, .9, .8] and B says the probabilities are [.7, .9, .5, .5, .6] and the event results are [false, false, true, true, false], how does one generally establish whether A or B was "better?"
I've been asked to clarify the specifics of the question, but my question is less about a specific solution and more that I'm interested in all the angles from which this problem is attacked.
If I were pressed into providing specifics, I'd point to a pair of thoughts I've had about this. First, when I hear people say, "I'm 30% sure that...." or, "There's only a 10% chance of that ever happening," I have wondered how you could gauge that person's skill at guessing probabilities. Clearly, I could collect a few hundred samples from that individual and see if the stuff he labels at 10% happens roughly 10% of the time, his 20s 20% of the time, 30s 30%, etc. But if the sample set is, say, a dozen different estimates covering the range from 0.0 to 1.0, how do you assess the quality of those estimates? This is a question that's been tickling the back of my mind for years.
Second, I might be writing a genetic algorithm that will look at current, highly local, weather patterns and guess the probability of the wind switching to 15-20MPH between SW and NW. I have decades of historical data, taken every few minutes, to train with, but analyzing the fitness of an individual in the gene pool hinges on that individual's accuracy in guessing probabilities.