Counter intuitive Prior/Posterior relationship in Bayesian inference for estimated probability fusion

106 Views Asked by At

I am trying to infer the probability distribution of a binary variable $X$ (True or False) using observations $O = \langle O_1,O_2,\ldots,O_n\rangle$, mutually independent given X.

I also have a ML algorithm that takes an observation $o_i$ and learns to predict a score $s_i$ which, I imagine, is an estimation of the probability of X being True, $\forall i, s_i \approx P(X=T\mid O_i=o_i)$.

Now, I want to compute the probability of $P(X=T\mid O=o)$ since I have multiple observations that should improve the final score by fusing the estimated probabilities given by the ML algorithm. Using Bayes formula I get:

$$P(X=T\mid O=o) = \frac{P(O=o_1,o_2,\ldots,o_n\mid X=T)P(X=T)}{P(O=o_1,o_2,\ldots,o_n\mid X=T)P(X=T)+P(O=o_1,o_2,\ldots,o_t\mid X=F)P(X=F)}$$

And since I can now separate each observation probabilities since they are independent given $X$, I get:

$$P(X=T\mid O=o) = \frac{P(X=T)\displaystyle\prod_{i}^{n}P(O_i=o_i\mid X=T)}{P(X=T)\displaystyle\prod_{i}^{n}P(O_i=o_i\mid X=T)+P(X=F)\displaystyle\prod_{i}^{n}P(O_i=o_i\mid X=F)}$$

But now here comes a problem, I can't put a value on $P(o_i\mid X)$, I only have $s_i \approx P(X=T\mid O_i=o_i)$ and the priors $P(X)$, so I use Bayes formula once again on $P(o_i\mid X)$. It seems to go well, the $P(O_i=o_i)$ are also simplified, but at the end I get this:

$$P(X=T\mid O=o) \approx \frac{P(X=T)^{-n+1}{\displaystyle\prod_{i}^{n}s_i}}{P(X=T)^{-n+1}\displaystyle\prod_{i}^{n}{s_i} + P(X=F)^{-n+1}{\displaystyle\prod_{i}^{n}(1-s_i)}}$$

It seems like a nice formula, and has nice properties (for instance, predicted scores of 0.5 are neutral to the final posterior probability, given $n$ is constant or given uninformative priors).

Unfortunately, it starts to look very wrong once you play with the prior. If you look closely, the priors $P(X)$ are driving the final probability toward the opposite probability. For instance, with a prior $P(X=T)=0.9$, the final probability goes towards $0$ which is very weird since it seems to me that priors shouldn't work like this at all.

However, the formula seems to work quite well when I set $P(X=T)=P(X=F)=0.5$ and doesn't have the usual downsides of computing the average of the score probabilities.


So, my questions are:

Is there something wrong with the proof, the assumptions or with the interpretation, or is this a correct formula with prior behaving counter-intuitively? Is there already a proof of a formula for doing probability fusion like this?

And also, are there other cases of prior driving the posterior to the opposite probability?

1

There are 1 best solutions below

4
On BEST ANSWER

You're essentially using a Naive Bayes assumption (observations are conditionally independent given the true value of X). The problem is in your description of what happens when you change the prior on $X=T$.

$$P(X=T\mid O=o) \approx \frac{P(X=T)^{-n+1}{\displaystyle\prod_{i}^{n}s_i}}{P(X=T)^{-n+1}\displaystyle\prod_{i}^{n}{s_i} + P(X=F)^{-n+1}{\displaystyle\prod_{i}^{n}(1-s_i)}}$$

There are two ways of looking at this that show a little more insight into the problem. Here is the first one:

$$P(X=T\mid O=o) \approx \frac{\prod_{i}^{n}s_i}{\prod_{i}^{n}{s_i} + \left(\frac{P(X=T)}{1 - P(X=T)}\right)^{n-1} \prod_{i}^{n}(1-s_i)}$$

In this view, the strong prior on $P(X=T)$ shows how it affects the "strength" of evidence given by a negative score. What this says is that the stronger your prior for $X=T$ is, the more influence observational evidence for $X=F$ pulls the posterior from the prior belief. Think of it like the amount of information an observation gives. If you have a very strong prior on $X=T$ then observing something that says $X=T$ doesn't do much. However, if you observe something that suggests $X=F$ that is very atypical, then you care very much about that. This will be very familiar to you if you've studied information theory. Essentially the amount of information an observation contains is inversely proportional to how likely it is to occur under your model.

The second way of looking at it becomes much more apparent if you actually write out what you're calling $s_i$.

$$P(X=T\mid O=o) \approx \frac{P(X=T)\prod_{i}^{n} \frac{P(X=T \ | \ O_i)}{P(X=T)}}{P(X=T)\prod_{i}^{n} \frac{P(X=T \ | \ O_i)}{P(X=T)} + (1 - P(X=T))\prod_{i}^{n} \frac{1- P(X=T \ | \ O_i)}{1 - P(X=T)}}$$

The gist of the explanation with this view is that you're adjusting the prior likelihood of $X=T$ while keeping the conditional likelihood of under the observed values the same. If you increase $P(X=T)$ while keeping the conditional probabilities fixed, the $s_i$ values, then you're down-weighting that evidence.

Say you have an $s_i$ with $0.75$ as the score value. If $P(X=T) = 0.5$ then you can see how these ratios intuitively show what you'd expect. If you increase the prior on $X=T$ then a score of $0.75$ is no longer sufficient to support that extreme certainty in the prior, and you're actually showing evidence against it. With a small number of data, the extreme prior will probably still win out (see the prior term is still preceding the product of ratios), but the Sagan quote rings true in this example. Extraordinary claims do in fact require extraordinary evidence.