Do the vertical bar and semicolon denote the same thing in the context of conditional probability?

5.8k Views Asked by At

In the CMU Machine Learning Lecture, likelihood function is denoted by $P(D|\theta)$

In the Cornell lecture note, likelihood function is denoted by $P(D;\theta)$

are semicolon and vertical bar the same here?

1

There are 1 best solutions below

6
On

Although some writers are a bit sloppy and inconsistent, technically, no, they are not the same. They are similar, but not quite identical.

The Sheffer stroke (vertical bar) is read "given that" or "contingent upon." Thus $p(x|y)$ means the probability density of $x$ given that some event or state of affairs $y$ has occurred. For instance if $x$ represents a height in centimeters, the term $p(x|{\rm woman})$ means the probability density we find the height value $x$ given that the person we are measuring is a woman. This is classical terminology of contingent probability density. This use is restricted to probability and statistics.

The semicolon is slightly different, and can apply in cases unrelated to probability and statistics. Suppose you have a mathematical function that depends upon two (or more) variables, say $f(x,y)$. Suppose you're primarily interested in the behavior of $f$ based on the value of $x$. You want to take derivatives with respect to $x$, limits for large and small $x$, count the zeros, and such. When you write $f(x;y)$ you're in essence creating a new function OF ONE VARIABLE ($x$), which we might call $g(x) = f(x;y)$. This notation recognizes that there is an implied or chosen value of $y$ when you examine $g(x)$, but it is not "part of the function." For instance, it is meaningless to take the derivative of $f(x;y)$ with respect to $y$ any more than it would to take the derivative of $g(x)$ with respect to $y$. The value of $y$ is a "setting" or a "parameter" that is fixed.

This notative helps avoid confusion when one asks about the intrinsic dimension of a function.

A particularly useful application of this semicolon notation in probability is in the Expectation-Maximization (or EM) Algorithm. In the algorithm some steps involve the optimization of a function (the expectation) with respect to one of the many variables. It is natural, then, to use $f(x;y)$ during this step to optimize over $x$, where the value of $y$ is fixed and "cannot be touched."

Another use in statistics and pattern recognition is the following. Suppose you have training data, $x_1, x_2, \ldots , x_d$. You create the dot product with some real-valued weight vector ${\bf w} = \{ w_1, w_2, \ldots , w_d \}$, as in ${\bf w}^t {\bf x} = w_1 x_1 + w_2 x_2 + \ldots + w_d x_d$. You want to optimize the value of ${\bf w}$ with respect to some classification. You create a function $h({\bf w}; {\bf x})$, meaning you can take derivatives with respect to the weights but it makes NO SENSE to take a derivative with respect to the data ${\bf x}$, nor is ${\bf w}$ "contingent upon" ${\bf x}$. Nevertheless, the particular functional form of $h$ has buried behind it the particular values of the data at hand.

Clear?