Suppose in a finite population (of size $n$), all configuration of individuals infected with the Corona virus (numbering $2^n$) are equally likely.
Consider the coronavirus testing procedure where there is a finite number of testing rounds. In each round, the population is grouped into not necessarily disjoint subsets. The swabs of all individuals in each subset are collected into one test tube and tested collectively for the presence of infection in that subset in the form of a binary result, i.e. either positive ($+$) or negative ($-$). There may be a subset not tested. The testing round terminates once all infected individuals are identified.
What is the design of the above rounds and the associated grouping in each round so that on average the total number of tests is the least?
I ask for the mean since the least of number of tests for the worst case is always the size of the population.
Here are two example procedures. Suppose the population size is $4$. Represent the individuals by the binary numbers $\{00,01,10,11\}$.
- In round 1, we make $3$ groups. The first consists of the individual $00$ alone. The second group consists of individuals whose first from the left digit is $1$ and the third of those whose second from the left digit is $1$. The swabs of all individuals are put into one test tube and tested once. The infection status of $00$ is at once settled. If group $2$ is $+$ and group $3$ is $-$. We know only $10$ is infected amongst $\{01,10,11\}$. Similarly for the case of group $2$ being $-$ and group $3$ being $+$. If both are $-$, the test is terminated. If both are $+$, we have to relabel the individuals $\{01,10,11\}$ by $\{00,01,10\}$ and proceed to round $2$.
In round 2, we proceed the same way as before. In effect, we test the three individuals one by one.
The whole test procedure then terminates.
- Another way simple procedure is to build a binary tree partition according to the binary representation. Round 1. Test the whole population. Round 2. Partition the population into two disjoint groups. The first consists of the individuals whose first left digit is $0$ whereas the second of those whose first left digit is $1$. In round 3., further partition each previous group according to whether the second left digit is $0$ or $1$.
Given the uniform distribution of the results, the optimal procedure is to test each individual individually.
In short, the reason is that to gain maximal information, a test should have probability $\frac12$ to yield either result, and if each individual independently has probability $\frac12$ of being infected, then individual tests are the only ones with this property.
We can get rid of the complication of the rounds, since any test schedule with rounds of tests combined could just as well be executed one test at a time.
More generally, say you want to distinguish between $m$ different outcomes (in your case $m=2^n$), and you can perform tests that tell you whether the outcome is in some arbitrary subset with $k$ of the $m$ outcomes. (This is more powerful than your tests, which only allow for some subsets of the $2^n$ outcomes.) We can prove by strong induction over $m$ that you need at least $\log_2m$ (in your case $n$) tests on average to identify the outcome.
For the base case $m=1$, this is clearly correct: $\log_21=0$, and we need $0$ tests to distinguish $1$ outcome.
Assume that it’s correct for all values less than $m$. If you run a test with a subset for $k$ of the $m$ outcomes, then with probability $\frac km$ you’ll be left with $k$ outcomes to distinguish and with probability $\frac{m-k}m$ you’ll be left with $m-k$ outcomes to distinguish, so by the induction hypothesis the expected number of tests required is at least
$$ 1+\frac km\log_2k+\frac{m-k}m\log_2(m-k)=\log_2m+1+\frac km\log_2\frac km+\frac{m-k}m\log_2\frac{m-k}m\;. $$
The last two terms are the negative binary entropy of the Bernoulli distribution with probability $p=\frac km$, which takes its minimal value $-1$ for $p=\frac12$. It follows that you can’t do better than $\log_2m$, and that this optimal average is achieved only if each test distinguishes exactly half of the remaining outcomes.
On the other hand, if the individuals have an independent probability $p\lt\frac12$ of being infected, then group testing may be beneficial. For instance, for two individuals, if you test them individually you are certain to require $2$ tests, whereas if you first test them as a group, then $A$ if necessary and then $B$ if necessary, you need $1$ test if neither is infected, $2$ tests if $A$ is not infected but $B$ is, and $3$ tests if $A$ is infected, for an average of
$$ (1-p)^2\cdot1+(1-p)p\cdot2+p\cdot3=1+3p-p^2\;, $$
which is less than $2$ for $p\lt\frac{3-\sqrt5}2\approx38\%$.