Bayes Theorem for a Disease Testing Problem

365 Views Asked by At

There are two tests for a disease, one is rapid and the other is slow. Given that an individual is infected, the rapid test will register positive $40\%$ of the time, while the slow test will register positive $80\%$ of the time; additionally, both tests will be positive $35\%$ of the time.

Suppose in the above example that people not infected always test negative for both tests.

Of the people in the population who are tested, $75\%$ of their results from the slow test are positive. What is the chance that a person has the virus conditioned on getting a negative result on the slow test?

I tried this

But how I got $P(V) = 20\%$ was from a different part of the problem that apparently is separate, so my answer is incorrect.

1

There are 1 best solutions below

2
On

Whenever you write an answer, always define your notation. This includes writing a complete and unambiguous description of what your variables mean. I look at your work and my first reaction is, "what are U, L, and S supposed to represent?"

For this part of the question, the rapid test characteristics are irrelevant. We define $V$ to mean that a randomly selected person is infected with the virus. $S$ indicates that a randomly selected person tests positive with the slow test.

For clarity, I will denote $\bar S$ as the complementary event that the person tests negative on the slow test. Similarly, $\bar V$ indicates a person is not infected.

Then we are given the following information:

$$\Pr[S \mid \bar V] = 0.$$ This means an uninfected person never tests positive.

$$\Pr[S] = 0.75.$$ This means a randomly selected person from the population will test positive on the slow test.

$$\Pr[S \mid V] = 0.8.$$ This means that an infected person tests positive on the slow test with probability $0.8$.

The desired quantity is $$\Pr[V \mid \bar S] = \frac{\Pr[\bar S \mid V]\Pr[V]}{\Pr[\bar S]}.$$ We know $$\Pr[\bar S] = 1 - \Pr[S] = 1 - 0.75 = 0.25.$$ Similarly, we know $$\Pr[\bar S \mid V] = 1 - \Pr[S \mid V] = 1 - 0.8 = 0.2.$$ But we do not know $\Pr[V]$, the unconditional probability that a random person is infected (i.e., the prevalence of disease in the population). It doesn't seem obvious how to proceed. So what we do is set up a hypothetical population and fill in the contingency table: say we have $10000$ people in the population. Then

$$\begin{array}{c|c|c|c} & V & \bar V & \text{Total}\\ \hline S & & & \\ \hline \bar S & & & \\ \hline \text{Total} & & & 10000 \end{array}$$

Then $\Pr[S] = 0.75$ means we fill in row $2$, column $4$ with $7500$, and row $3$, column $4$ with $2500$. We also note $\Pr[S \mid \bar V] = 0$ means that row $2$, column $3$ must contain $0$:

$$\begin{array}{c|c|c|c} & V & \bar V & \text{Total}\\ \hline S & & 0 & 7500 \\ \hline \bar S & & & 2500 \\ \hline \text{Total} & & & 10000 \end{array}$$ This clearly implies row $2$ column $2$ contains $7500$. So what must be the total in column $2$ such that $\Pr[S \mid V] = 0.8$? Clearly it needs to be $7500/0.8 = 9375$:

$$\begin{array}{c|c|c|c} & V & \bar V & \text{Total}\\ \hline S & 7500 & 0 & 7500 \\ \hline \bar S & & & 2500 \\ \hline \text{Total} & 9375 & & 10000 \end{array}$$ And the rest is just subtraction:

$$\begin{array}{c|c|c|c} & V & \bar V & \text{Total}\\ \hline S & 7500 & 0 & 7500 \\ \hline \bar S & 1875 & 625 & 2500 \\ \hline \text{Total} & 9375 & 625 & 10000 \end{array}$$

This gives us $\Pr[V] = 9375/10000 = 0.9375$ and the rest of the problem is arithmetic.

While the contingency table is useful, it lacks formality. What is the underlying mathematical reasoning that justifies the result? Following the logic we used, it becomes clear that $\Pr[S \mid \bar V] = 0$ implies $\Pr[S \cap \bar V] = \Pr[S \mid \bar V]\Pr[\bar V] = 0$, hence $$\Pr[S \cap V] = \Pr[S] - \Pr[S \cap \bar V] = \Pr[S] = 0.75.$$ Then because $\Pr[S \cap V] = \Pr[S \mid V]\Pr[V]$, we have $$\Pr[V] = \frac{\Pr[S \cap V]}{\Pr[S \mid V]} = \frac{0.75}{0.8} = 0.9375.$$ This is why the table is a powerful tool: it helps us see how to fill in the missing information, and then formalize those steps using probability notation.