Simulating the Behavior of a Screening Test for a Disease

1k Views Asked by At

A test for a certain rare disease is assumed to be correct 95% of the time: if a person has disease, the test results are positive with probability 0.95, and if the person does not have the disease, the test results are negative with probability 0.95. A random person drawn from a certain population has probability 0.001 of having the disease.

Given that the person just tested positive, what is the probability of having the disease?

Just pseudocode would be great

3

There are 3 best solutions below

0
On

Try something like this:

  • Starting with $1000000$ people, how many do you expect to have the disease and how many not?
  • Of those with the disease, how many do you expect to test positive (true positive) and how many negative (false negative)?
  • Of those not with the disease, how many do you expect to test positive (false positive) and how many negative (true negative)?
  • Of those testing positive, what proportion do you expect to be true positives?

You can easily simulate the first three steps and then calculate the fourth, though like any simulation you should not expect the calculation to be particularly accurate, especially of a very rare disease.

0
On

If you want a simulation, try following procedure:

  1. Initialize TP - true positive people counter and FP - false positive people counter with zeros;
  2. Repeat the following N times (a total number of people simulated):

    2.1. Roll random variable assuming $1$ with probability $0.001$ (observed sick person) and $0$ with probability $0.999$ (observed healthy person);

    2.2. If in previous step you got $1$, with probability $0.95$ increase TP by one (sick person tested positive); otherwise with probability $0.05$ increase FP by one (healthy person tested positive);

  3. Probability to be sick if tested positive is ${TP}/(TP + FP)$.

Given that you have rather low probability of being sick in the first place, you would need big enough N to converge to mean values (probability of being sick given tested positive is $0.018664\ldots$).

0
On

Please read my Comment first:

Code in R. Using ten million iterations gives about three-place accuracy. Both of the vectors dis and pos are of length m. Brackets [ ] can be read 'such that'. Assignment is =; logical equality is ==; mean of a logical vector is the proportion of its TRUEs. Structure avoids explicit loops and 'if...then' branching. Retain set.seed statement for exactly the same simulation, omit for unpredictable seed and fresh simulation.

answ = (.001*.95)/(.001*.95 + .999*.05);  answ
[1] 0.01866405                                   # exact P(Dis|Pos)
set.seed(627)
m = 10^7;  pos = numeric(m)
dis = rbinom(m, 1, .001);  nr.dis = sum(dis==1)  # 1=disease, 0=none
  pos[dis==1] = rbinom(nr.dis, 1, .95)             # 1=pos test ...
  pos[dis==0] = rbinom(m-nr.dis, 1, .05)           #  ... 0 = neg test
mean(dis[pos==1]==1)                             # aprx P(Dis|Pos)
[1]  0.01879157

In the language of Bayesian inference and screening tests, the 'prevalence' 0.001 of the disease is a 'prior probability', and the 'predictive value of a positive test' $P(\text{Disease | Positive})$ is a 'posterior probability'. Here, although PVP $\approx 2\%$ is small, it is relatively much larger than prevalence $0.1\%$.