There is a random variable $B$ (a box) and a particular value it takes $r$ (red color). In Bishop's Pattern recognition and machine learning textbook, he said in p. 14:
Instead, we may simply write $p(B)$ to denote a distribution over the random variable $B$, or $p(r)$ to denote the distribution evaluated for the particular value $r$...
And then he goes on to say for r.v.s $X$ and $Y$.
...the quantity $p(Y |X)$ is a conditional probability and is verbalized as "the probability of $Y$ given $X$"
Here are my questions:
- What exactly does "a distribution over the random variable $B$" mean? Does this distribution mean some mapping of values that r.v. can take to probability values? I usually think about distribution in terms of PDF or CDF. I think $p(r)$ is the density (or PMF in discrete r.v. case).
- How can we condition on a random variable $X$? My understanding is that we can only condition on events. Does $p(Y |X)$ mean conditional distribution of $Y$ given $X$ taking some value which is not specified?
$p(B)$ represent the probability mass functions of random variable $B$ at arbitrary values. It is a terribly lazy (but unfortunately common) shorthand used when authors are more interested in showing the dependencies between random variables rather than any particular evaluations.
$p(r)$ demonstrates why this is a horribly confusing idea, as the "probability mass function of value $r$", is meaningless unless it is implicitly clear what is the random variable being discussed.
More properly we should write something like $p_{\small B}(r)$ to indicate the probability mass function of random variable $B$ evaluated at $r$. In this case that is $\mathsf P(B=r)$; the probability for the event of the box being red.
Again, the this appears to be an abbreviation for $p_{\lower{0.5ex}{\small Y\mid X}}(y\mid x)$, or $\mathsf P(Y=y\mid X=x)$, where $x,y$ are arbitrary arguments.
The occurrence of $X$ realising a particular value is an event.