Bayesian Theory's likelihood meaning and model

54 Views Asked by At

I am trying to understand Bayes' Theorem when applied in Machine Learning.

Basically, Bayes' Theorem is given by the formula:

$$ P(A\&B|B) = \frac{P(B\&A|A)*P(A)}{P(B)} $$

I understand how the Theorem was derived (basically from the different relationships that P(A&B) has with conditional probabilities), and I kind of understand the Bayesian relationship as well. P(A) is the prior belief of the probability of event A occurring (that is with no prior event). P(A&B|B) is the posterior probability of our belief of event A happening knowing that event B has occurred.

Now the first confusion arises when we say that:

P(B&A|A) is the likelihood probability, what does that even mean? How did P(B&A|A) become a likelihood (why isn't it a probability)?

Here is a concrete example from a book that I am reading:

We have a red box and blue box each containing apples and oranges. The probability of choosing a blue box is 60% and choosing a red box is 40%. Blue box contains 3 apples and one orange while red box contains 6 oranges and 2 apples. Furthermore, the probability of choosing any fruit within a box are equally likely.

Problem visualized

Now we want to know the probability of choosing a red box given that the fruit was orange. i.e P(B=r|F=o)

We can calculate that by simply:

$$P(B=r|F=o) = \frac{P(F=o|B=r)*P(B=r)}{P(F=o)}$$

Here is my understanding of the equation:

Our prior belief that we would choose a red box was P(B=r) which is 40%, however once we got evidence that we picked orange then our posterior belief becomes P(B=r|F=o).

Now P(F=o|B=r) is the likelihood, what does it mean in this context? Isn't it just the probability of getting an orange given that we picked a red Box. I guess my question becomes what does likelihood mean in a discrete distribution and when do we decide that it is a likelihood instead of probability?

I have searched online for this specific question, but none gave me a satisfying answer. They all seem to just give a formal definition of the equation and what each term represents.

This leads me to another question. How is that related (if it is) to Bayes' Theorem in terms of choosing the best model.

What I mean is that we have:

$$ P(m|D) = \frac{P(D|m)*P(m)}{P(D)} $$

Where m is a model, and D is a data set. What confuses me is how do we have P(m), like what does that even mean? Probability of a model? How can we quantify a probability of a model? Similarly for P(D) how can we quantify a probability for a data set? Is it the probability of observing such a dataset?

I am trying to relate this logic to the concrete example above. Are these two concepts related? and if they are, does that mean that the set of the boxes represent a model? and is the data set oranges and apples?

I am sorry for the long question; it is just that I have been trying to understand Bayes' Theorem and these questions seem to pop up in my head.