Is Bayes' Theorem really that interesting?

12k Views Asked by At

I have trouble understanding the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science.

From the purely mathematical point of view, I think it would be uncontroversial to say that Bayes' theorem does not amount to a particularly sophisticated result. Indeed, the relation $$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(B\cap A)P(A)}{P(B)P(A)}=\frac{P(B|A)P(A)}{P(B)}$$ is a one line proof that follows from expanding both sides directly from the definition of conditional probability. Thus, I expect that what people find interesting about Bayes' theorem has to do with its practical applications or implications. However, even in those cases I find the typical examples being used as a justification of this to be a bit artificial.


To illustrate this, the classical application of Bayes' theorem usually goes something like this: Suppose that

  1. 1% of women have breast cancer;
  2. 80% of mammograms are positive when breast cancer is present; and
  3. 10% of mammograms are positive when breast cancer is not present.

If a woman has a positive mammogram, then what is the probability that she has breast cancer?

I understand that Bayes' theorem allows to compute the desired probability with the given information, and that this probability is counterintuitively low. However, I can't help but feel that the premise of this question is wholly artificial. The only reason why we need to use Bayes' theorem here is that the full information with which the other probabilities (i.e., 1% have cancer, 80% true positive, etc.) have been computed is not provided to us. If we have access to the sample data with which these probabilities were computed, then we can directly find $$P(\text{cancer}|\text{positive test})=\frac{\text{number of women with cancer and positive test}}{\text{number of women with positive test}}.$$ In mathematical terms, if you know how to compute $P(B|A)$, $P(A)$, and $P(B)$, then this means that you know how to compute $P(A\cap B)$ and $P(B)$, in which case you already have your answer.


From the above arguments, it seems to me that Bayes' theorem is essentially only useful for the following reasons:

  1. In an adversarial context, i.e., someone who has access to the data only tells you about $P(B|A)$ when $P(A|B)$ is actually the quantity that is relevant to your interests, hoping that you will get confused and will not notice.
  2. An opportunity to dispel the confusion between $P(A|B)$ and $P(B|A)$ with concrete examples, and to explain that these are very different when the ratio between $P(A)$ and $P(B)$ deviates significantly from one.

Am I missing something big about the usefulness of Bayes' theorem? In light of point 2., especially, I don't understand why Bayes' theorem stands out so much compared to, say, the Borel-Kolmogorov paradox, or the "paradox" that $P[X=x]=0$ when $X$ is a continuous random variable, etc.

8

There are 8 best solutions below

19
On BEST ANSWER

You are mistaken in thinking that what you perceive as "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science" is really "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science." But it's probably not your fault: This usually doesn't get explained very well.

What is the probability of a Caucasian American having brown eyes? What does that question mean? By one interpretation, commonly called the frequentist interpretation of probability, it asks merely for the proportion persons having brown eyes among Caucasian Americans.

What is the probability that there was life on Mars two billion years ago? What does that question mean? It has no answer according to the frequentist interpretation. "The probability of life on Mars two billion years ago is $0.54$" is taken to be meaningless because one cannot say it happened in $54\%$ of all instances. But the Bayesian, as opposed to frequentist, interpretation of probability works with this sort of thing.

The Bayesian interpretation applied to statistical inference is immune to various pathologies afflicting that field.

Possibly you have seen that some people attach massive importance to the Bayesian interpretation of probability and mistakenly thought it was merely massive importance attached to Bayes's theorem. People who do consider Bayesianism important seldom explain this very clearly, primarily because that sort of exposition is not what they care about.

18
On

First see the comments following this answer, especially the last few comments. I was totally unaware that Bayes Theorem is simply a consequence of axioms around the definition of Conditional Probability. Based on this assertion, I can't refute the idea that the following problem can be solved without Bayes Theorem.


Hard to imagine attacking a conditional probability problem without it. Imagine traveling back in time 1000 years. You are the captain of a ship. You have two sailors, A and B that you independently use to predict rain.

A is right 90% of the time and B is right 80% of the time.
A says it will rain today, and B says it won't rain today.
Absent Bayes Theorem, and absent any info on how often (in general) it rains, how do you (intuitively) determine the chance that it will rain today? Clearly, the problem is well defined, so the problem has a meaningful answer. Absent Bayes Theorem, or anything like it, how do you compute the answer?

5
On

There are two main issues here. One is that on a Bayesian interpretation of probability (this term doesn't reference the theorem, but they're both named for Bayes), probability quantifies how well we know individual events, not detailed available frequency statistics. The best-of-both-worlds hope, if you combine Bayesian and frequentist perspectives, is that past data give us the mammogram values you cited, and an individual woman can be diagnosed based on Bayes's theorem.

The second issue is that $P(A|B)$ need not be remotely close to $P(B|A)$. To wit:

  • A test that's usually right may still have most of its positives be false, which warrants some scepticism, as well as further testing.
  • Conflating $P(A|B)$ with $P(B|A)$ is a danger in the legal system. Will we arrest people based on accuracy, precision etc., even if their guilt is unlikely? Will "this evidence is unlikely if they're innocent" get them convicted, even though it may not mean their innocence is unlikely? And yes, this has had real-world fallout in both policing and court decisions.
  • Statistics tests what probability assumes (e.g. "if this is Gaussian then..."). Statistical tests often boil down to, "we can't measure the probability the null hypothesis is true, but we'll assess it based on the probbaility on the null hypothesis that data at least this surprising would occur". Indeed, which statement gets to be the null hypothesis is more about its facilitating such calculations than its being a "default" or "reasonable" assumption.
2
On

While I agree with Michael Hardy's answer, there is a sense in which Bayes' theorem is more important than any random identity in basic probability. Write Bayes' Theorem as

$$\text{P(Hypothesis|Data)}=\frac{\text{P(Data|Hypothesis)P(Hypothesis)}}{\text{P(Data)}}$$

The left hand side is what we usually want to know: given what we've observed, what should our beliefs about the world be? But the main thing that probability theory gives us is in the numerator on the right side: the frequency with which any given hypothesis will generate particular kinds of data. Probabilistic models in some sense answer the wrong question, and Bayes' theorem tells us how to combine this with our prior knowledge to generate the answer to the right question.

Frequentist methods that try not to use the prior have to reason about the quantity on the left by indirect means or else claim the left side is meaningless in many applications. They work, but frequently confuse even professional scientists. E.g. the common misconceptions about $p$-values come from people assuming that they are a left-side quantity when they are a right-side quantity.

0
On

Let me start by a memory. From my undergraduate days, 30 years ago, I vividly remember the time when Bayes was introduced. We had spent a lot of time and effort on sampling theory and how to know if things could be proved. And to me, at the time, it always ended up that we needed to have a sample size of x (my remembrance was that a sample size of 7 often was the minimum).

To me Bayes represented a totally different approach which to me was more in alignment with my view of reality. In sampling we looked at groups, with Bayes we started with individual things. So for me this was a very eye-opening addition to the field of probability praxis (and theory of course, but that came later for me). The book we had, written by Raiffa I believe, was about decisition theory. 30 years later I still remember the discussion about whether to do one more test drilling in the oil field.

So, just maybe, in your curriculum the importance placed on Bayes is there to show that statistics does have several different branches, not only sampling theory or how present graphs as correct as possible.

0
On

You might know only $\Pr[A\mid B]$ and not $\Pr[B\mid A]$, not because someone "adversarially told you the wrong one", but because one of those is a natural quantity to compute, and the other is a natural quantity to want to know.

I am about to teach Bayes' theorem in an undergraduate course in probability. The general setting I want to consider is when:

  • We have several competing hypotheses about the world. (Several candidates for $B$.)
  • If we assume one of these hypotheses, then we get a nice and easy probability problem where it's easy to find the probability of $A$: some observations that we've made. (Outside undergraduate probability courses, "nice and easy" is a relative term.)
  • We want to figure out which hypothesis is likelier.

The mammogram example might be natural, but it's less obviously natural because we have to track down where the numbers that are given to us come from, and ask why we couldn't be given the other quantities in the problem. So here are some examples where we have fewer numbers coming to us out of thin air.

  1. Suppose you are communicating over a binary channel which flips bits $10\%$ of the time. (This part is given to us out of nowhere, but it's the natural quantity to ask about first.) Your friend has several possible messages they might send you: these are the hypotheses $B_1, B_2, \dots, B_n$. You receive a message: that's the observation $A$. Then $\Pr[A \mid B_i]$ is just $(0.1)^k (0.9)^{n-k}$ if $B_i$ is an $n$-bit message that differs from the one you received in $k$ places. On the other hand, $\Pr[B_i \mid A]$ is the quantity we want: it will tell us how likely it is that your friend sent each message.
  2. You have a coin, and you don't know anything about its fairness. One possible assumption is that it lands heads with probability $p$, where $p \sim \text{Uniform}(0,1)$, but we could vary this. Then you flip the coin $n$ times and see $k$ heads. There are infinitely many hypotheses $B_p$, one for each possible $p$; under each of them, $\Pr[A \mid B_p]$ is just a binomial probability. Knowing the conditional PDF of $p$, which is what Bayes' theorem tells us, tells us more about how likely the coin is to land heads.
3
On

You are correct that Bayes' theorem follows trivially from axioms of probability that everyone accepts. The difference between Bayesans and frequentists is a cultural one. The actual mathematical axioms they subscribe to are trivially homologous.

The cultural divide is a pretty stark one though.

  • Frequentists tend to think computation is a dirty word and they dont care to analyse problems that they cannot approach analytically, so basically they would prefer to think that everything is a gaussian. Also some of them tend to do this funny numerology thing where they fetishise numbers like 0.01 and 0.05

  • Bayesians think that if they write down a uniform prior as a formula it looks more like real mathematics and less like a stupid assumption that rarely applies (appeals to 'entropy' make them feel great too); and they delude themselves into thinking that labelling part of their likelihood function a prior makes them special; as if frequentists couldn't multiply different likelihood functions together to get a joint one just fine.

Actual examples where a non-strawman version of either approach to the same problem yields a different result, do not actually exist. Because there are not actually any differences in the fundamental axioms they subscribe to. That being said it is not as if the language, computational tools, and modelling approaches you use are unimportant to guiding your thought process. Itd be better if teaching methods focussed more on said homology though.

0
On

Not exactly an answer to the posted question, but Bayesian ideology is important in many practical problems in artificial intel, including character recognition, medical diagnoses, and more, the key structure being a Bayesian inference network.