Sufficient statistic and the maximum likelihood estimator of the probability of having an infectious disease when people are grouped and tested

42 Views Asked by At

Suppose N students arriving at a college are all equally likely to have a particular disease with an unknown probability p. The disease status (affected / not affected) of all students are independent. Blood samples are collected from all N students. In order to estimate p, two strategies are proposed.

Strategy 1 Test all samples separately to obtain the status for all N students.

Strategy 2 Randomly partition the students into m disjoint groups, each comprising K = N/m students (with K≥ 2 being > an integer). For each group, pool (mix) the blood samples. from all K students within the group and test the pooled sample. If the pooled sample tests positive, then at least one student within that group is affected. If the pooled sample tests negative, all students within that group are unaffected.

(a) Based on the data obtained from Strategy 2, find a real valued sufficient statistic for p and the maximum likelihood estimator of p.

(b) If a group tests positive, then all students within that group are further tested individually. Suppose that each test (for individual sample or pooled sample) has equal cost. Then, which of the two strategies would you prefer to identify all students affected with the disease when the underlying p=0.5, N=200, and m = 20?

For part (b), in each group we have 10 people, and the expected number of tests for a group is $1*(1-0.5)^{10} + 11*(1- (1-0.5)^{10})=11-10*0.5^{10}$. Let the cost of 1 test be c, then the expected cost of test is $20c(11-10*0.5^{10})$ in case 1 while it's $200c$ in case 2 which is lesser. So I'd prefer strategy 1.

Can someone please help with part (a) and also check the solution for part (b)? I have to idea how to come up with a sufficient statistic here.

1

There are 1 best solutions below

0
On BEST ANSWER

For each $1\leq j \leq m$ take $X_j=1$ if the blood from group $j$ tests positive and $X_j=0$ otherwise. Then $X_j \sim \text{Bernoulli}\left(1-(1-p)^K\right)$.

For $x_1,...,x_m\in \{0,1\}$ fixed notice $$\mathbb{P}(X_1=x_1,...,X_m=x_m)=\left[1-(1-p)^K\right]^{\sum_{j=1}^m x_j}\left[(1-p)^K\right]^{m-\sum_{j=1}^mx_j}$$

Because the joint distribution of the random vector $(X_1,...,X_m)$ is a function of $\sum_{j=1}^mX_j$ we see $T(X_1,...,X_m)=\sum_{j=1}^mX_j$ is sufficient for $p$. You can check easily that $\mathbb{P}(X_1=x_1,...,X_m=x_m)$ is maximized at $p=\hat{p}$ where $\hat{p}=1-\left(1-\frac{1}{m}\sum_{j=1}^{m}x_{j}\right)^{\frac{1}{k}}$. This makes $\hat{p}$ your sought after MLE.

The number of tests performed in part $(b)$ is the random variable $$m+K(X_1+\dots +X_m)$$ whose expected value is $$m+Km\left[1-(1-p)^K\right]=m+N \left[1-(1-p)^{K}\right]$$ This agrees with your answer after taking $p=1/2,N=200,$ and $m=20$.