Why are $p(y)$ called "prior frequencies for classes"?

20 Views Asked by At

Why are $p(y)$ called "prior frequencies for classes"?

Since they apply on $y$, not $x$. But since the prediction is made $x \rightarrow y$, then $y$ should be posterior, right?

Particularly, this is in the context of Naive Bayes Classifier.

Or do my notes contain an error?

1

There are 1 best solutions below

0
On

Generally for a Bayesian model trying to relate $x$ to $y$, we are interested in the conditional distribution: $$ p(y|x) = \frac{p(x|y)p(y)}{p(x)} $$ for features $x=(x_1,\ldots,x_n)$. Terminology-wise, $p(y|x)$ is the posterior, $p(x|y)$ is the likelihood, and $p(x)$ is the evidence.

Why is $p(y)$ the prior? Think of it in Bayesian terms. Suppose you are waiting for the next data point $x$ to arrive, and you want to guess the value of $y$ that will be associated to it. Since you don't know what $x$ will be, any data that you have already seen involving $x$ is not really useful. However, the marginal distribution of the data you have seen over $y$ is still useful. So, while still waiting for $x$ to arrive, you can go through your dataset $Y=(y_1,\ldots,y_n)$ of $y$ values (ignoring the $X$ part) and use it to estimate $p(y)$. This forms your prior belief in what $y$ could be when it arrives along with $x$. In this sense, prior refers to prior to receiving the data.

Note that your learned classifier is estimating $p(y|x)$, i.e. the conditional probability. Given an unlabeled piece of data $x$, one can then estimate the most likely $y$ associated to it. The prediction is made using $x$; hence it involves the posterior. The prior is a distribution over $y$ without reference to $x$ at all.

In summary, $p(y)$ is quite literally the distribution over classes without reference to $x$. Hence it is called the prior distribution over classes because it forms your prior belief about an unseen label $y$ before seeing any new data $x$.


As an aside, for generative models, we are interested in the joint distribution: $$ p(x,z)=p(z|x)p(x)=p(x|z)p(z) $$ where $z$ is a latent form of the data. Here the posterior is also $p(z|x)$, but the prior $p(z)$ over the latent variables is something that we choose. For (naive) Bayes, the prior can be estimated from the data.