Confidence interval of the number of mistakes in a document

Question

Confidence interval of the number of mistakes in a document

40 Views Asked by Bumbble Comm At 25 Feb 2026 - 8:05

Question

There is a document containing $n$ typing mistakes, where $n$ is an unknown constant. To investigate the value of $n$, $m$ more typing mistakes, where $m$ is given, are deliberately introduced in the document and the document is reviewed by a checker. For each mistake, the probability that the mistake is found by the checker is $p$, where $p$ is an unknown constant. In other words, the number of unknown mistakes found by the checker follows the binomial distribution $B(n,p)$ and the number of known mistakes found by the checker follows the binomial distribution $B(m,p)$.

Suppose the checker finds $x$ unknown mistakes and $y$ known mistakes. Construct an approximate $95\%$ confidence interval for $n$.

Attempt

Using the data of the known mistakes, an approximate $95\%$ confidence interval for $p$ would be $$\left(\frac{y}{m}-z_{0.975}\sqrt{\frac{y(m-y)}{m^3}},\frac{y}{m}+z_{0.975}\sqrt{\frac{y(m-y)}{m^3}}\right)$$

On the other hands, for the data of the unknown mistakes, $x\approx np$ is expected so $$\left(\frac{x}{\frac{y}{m}+z_{0.975}\sqrt{\frac{y(m-y)}{m^3}}},\frac{x}{\frac{y}{m}-z_{0.975}\sqrt{\frac{y(m-y)}{m^3}}}\right)$$ might be a candidate of an approximate $95\%$ confidence interval for $n$.

Does my guess make sense? Is there any better way to construct the confidence interval? Thank.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Consider $M = 300$ known errors with $x = 700$ and $y = 200$ detected out of $N$ (unknown) and $M,$ respectively. A crucial assumption is that the two types of typographical errors are equally easy to detect.

Then from the 'known' errors $\hat p = 200/300 = 0.667.$ Using this same $\hat p$ for unknown errors we have the point estimate $\hat N = x/\hat p = 700/0.667 = 1050.$

Then the Wald CI for $p$ is of the form $\hat p = 1.96 \sqrt{\frac{\hat p(1-\hat p)}{M}},$ which computes to $(0.6133,0.7200).$ The corresponding CI for $N$ (rounded to integers) is $(972, 1141).$

The Wald CI for $p$ based on 300 trials should be reasonably accurate. If the number of known errors is much smaller than that, you should use the Agresti-Coull estimate $\tilde p = \frac{y+2}{M+4}$ instead. Then the CI is of the form $\tilde p \pm 1.96\sqrt{\frac{\tilde p(1-\tilde p)}{M+4}}.$ This and other alternative CIs are discussed in this Wikipedia article.

You will find that values of $M < N$ and, especially values of $M$ so small that $y$ (because of small $p)$ may be in single digits, can result in uselessly long confidence intervals for $N.$ Optimal strategy for estimation might be to choose $M$ near to supposed value of $N,$ if feasible.

If this is a class problem in an elementary or intermediate level course, my guess is that yours is the intended solution. However, you explicitly asked whether there are better methods. There may be, but they may lead you into Bayesian approaches and computationally intensive methods such as bootstrapping.

In a simpler context where there is a single binomial model, efforts to estimate binomial $N$ are notoriously unstable because small differences in the number of Successes can result in large differences in the estimates of $N$. Various approaches have been taken toward better estimation of $N.$ You might start by searching the Internet for estimating binomial N. Some of the papers are freely available as are some author PDF's of preprints and related work. Perhaps see freely available DasGupta & Rubin, Carroll & Lombard, and their references. Levels of required statistical knowledge vary greatly.

Confidence interval of the number of mistakes in a document

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in STATISTICAL-INFERENCE

Related Questions in PARAMETER-ESTIMATION

Related Questions in CONFIDENCE-INTERVAL

Trending Questions

Popular # Hahtags

Popular Questions