What is p(Evidence) exactly in a bayesian model?

49 Views Asked by At

I'm having a hard time intuitively understanding what this means in a machine learning context. When using the variables $A$ or $B$ or some trivial example, it all makes sense, but when looking at machine learning formulas where there are real variables its harder to see exactly what is meant. For example, if $t$ is what I am trying to predict and $x$ is the training example or input...

$$ p(t|x) = \frac{p(x|t)p(t)}{p(x)} $$

What is meant by $p(x)$? if $x$ is a training example, does it mean the probability of seeing $x$ out of all possible training examples (kind of like the probability of drawing $x$ from a hat)? the probability of seeing $x$ out of the previously known distribution of examples? or something else?

Sometimes I see this with model parameters such as $\theta$ as well which raises the same sort of questions.

2

There are 2 best solutions below

5
On BEST ANSWER

Let's take your dice example to try to illustrate the issue. Here $T$ is your uncertain parameter and $t$ a value it can take, while $X$ is your observation and $x$ a particular value it can take.

  • Suppose you have a $t$-sided fair die, but you do not know what value $t$ has. You do have a prior distribution for $t$ of $P(T=t) = \frac{t}{2^{t+1}}$ for $t \in \{1,2,\ldots\}$.

  • You roll the die and observe a value of $X=x$. Since this is a fair die, you know $P(X=x \mid T=t) = \frac{1}{t}$ for $x \in \{1,2,\ldots\,t\}$

  • You can at this stage ask what is the unconditional $P(X=x)$? In other words, at the start what do you think the probability is of rolling a particular value $x$ even though you do not know how many sides the dice has? As a simple application of conditional probability $$P(X=x) = \sum P(X=x \mid T=t) P(T=t) = \sum\limits_{t=x}^\infty \frac{1}{2^{t+1}} = \frac{1}{2^{x}}$$

As examples, from the first bullet $P(T=6)=\frac{6}{128}$ and $P(T=7)=\frac{7}{256}$ etc. So the unconditional or marginal probability of rolling $X=6$ is $$P(X=6) = \frac{1}{6} \times \frac{6}{128} + \frac{1}{7} \times \frac{2}{256}+ \cdots = \frac{1}{64}= \frac{1}{2^6}$$

If you do roll a $6$ then you then know the number of sides $T \ge 6$, and you get a posterior probability mass function $$P(T=t \mid X=6) = \frac{\frac{1}{2^{t+1}}}{\frac{1}{2^{6}}} = \frac{1}{2^{t-5}}$$ for $t \ge 6$, so $P(T=6 \mid X=6)= \frac12$, $P(T=7 \mid X=6)= \frac14$, etc.

1
On

Well, if you have the joint probability $p_{X,Y}(x,y)$, then $p_X(x)=\sum_y p_{X,Y}(x,y)$ is a marginal probability.

$p_{X|Y} = ( p_{Y|X} * p_X ) / p_Y$ has the form

Posterior = ( Likelihood $*$ Prior ) / Evidence.