How many typos are in the document?

383 Views Asked by At

We assume that $1\%$ of characters in a document are typos.

We want to find the probability of having at most 2 typos inside an 100 characters document.

We want to find it in two ways:

a)Precisely

b)Approximately


For a: Let $X$ denote a distinct random variable. We want $$P(X \le 2) = $$ $$P(X=0)+P(X=1)+P(X=2) =$$ $$\binom{100}{0}(0.01)^0(1-0.01)^{100-0}+\binom{100}{1}(0.01)^1(1-0.01)^{100-1}+\binom{100}{2}(0.01)^2(1-0.01)^{100-2}$$

Is this right?


For b:

Can someone explain with simple words how do we handle this approximately?

Do we choose always a distribution for this?

And if so, how do we choose the right distribution?


Please KIS (Keep It simple) as much as you can. Thank you.

This is linked to: Find the probability for defective diskettes.

3

There are 3 best solutions below

4
On BEST ANSWER

The assumption is poorly stated. Taken literally, it would mean that a $100$ character document always has exactly $2$ typos. What it should say is that each character has probability $0.01$ of being a typo, and the characters are independent.

As in the last example, the true distribution is binomial, in this case with $n = 100$ and $p = 0.01$.

There are basically two approximations to the binomial distribution that are considered in elementary probability: the Poisson and the normal. The Poisson distribution is good when $n$ is large and $p$ is small so that $np$ is not too big. The normal is good when $n$ is large and $p$ is not very close to $0$ or $1$. In this case $p = 0.01$ is small, so the approximation to use would be Poisson.

EDIT: For interest, the graph below shows, as a function of $p$, the maximum (for integers $x$ from $0$ to $100$) of the error in $P(X \le x)$ when using the Poisson distribution (in blue) or the normal distribution with continuity correction (in red), where $X$ is binomial with $n=100$. From this point of view, Poisson is better for $p < 0.115$ approximately.

enter image description here

5
On

As Robert points out, the Poisson approximation is better in this case. In this case, the actual answer is about 0.920627, the Poisson approximation gives 0.919699, and the normal approximation with continuity correction gives 0.935133. I show the normal approximation below as an alternative method that is very helpful in many situations.

Let's assume $X$ denotes the random variable that is 1 if a certain character is a typo and is 0 if it is not. Then $X$ is a Bernoulli random variable with mean $0.01$ and variance $0.01 \cdot 0.99 = 0.0099$. If we have 100 characters, it's as if we have 100 independent random variables, $X_1, X_2, \ldots, X_{100}$, added together, where each $X_i$ is the same distribution I just described for $X$. Therefore, the Central Limit Theorem tells us that this sum is approximately normal. So, we start by finding the mean and variance of the sum $$Z = X_1 + X_2 + \cdots + X_{100}.$$ First, we have $$E[Z] = E[X_1] + E[X_2] + \cdots + E[X_{100}] = 100 E[X] = 100 \cdot 0.01 = 1$$ and for the variance, we have $$Var[Z] = Var[X_1] + Var[X_2] + \cdots + Var[X_{100}] = 100 \cdot Var[X] = 100 \cdot 0.0099 = 0.99.$$ So, we know that $Z$ is approximately normal with mean $1$ and variance $0.99$. So, we calculate the probability $P(Z \leq 2)$ by calculating $P(N \leq 2.5)$, where $N$ is a normal random variable with mean 1 and variance 0.99. This will approximate $P(Z \leq 2)$. Using $N \leq 2.5$ is called a continuity correction. The point is, the only values of this random variable are integers, $0, 1, 2, \ldots$. So, by using the continuity correction, we get half the amount between 2 and 3, which makes it more accurate.

Hopefully you can take it from there.

0
On

Another approach could be the Poisson distribution, since the p-value is very small. THis formula can be found here: http://infinity.cos.edu/faculty/woodbury/stats/tutorial/Pois_Form.htm The expection lambda is np, here 1, and x is going from 0 to 2. Apply formula 3 times.