I’ve encountered the following seemingly simple probability interview question in my workplace:
Two reviewers were tasked with finding errors in a book. The first had found 40 errors and the other had found 60. 20 of the found errors were found in common. Give an estimate on the number of errors in the book.
A few clarifications were given:
- The errors are not false positives.
- The probability of the reviewers to find any error is independent of each other. (Problematic phrasing?)
- The lower bound is not required (i.e at least 80 errors).
It was my opinion that this problem is not well defined and any answer would rely on hidden assumptions.
My coworker said that the solution is easily calculable using the following method assigning to x the total number of errors:
$$P(A) = \frac{40}{x}$$ $$P(B) = \frac{60}{x}$$ $$P(A\cap B) = \frac{20}{x}$$ $$P(A\cap B) = P(A) * P(B)$$ $$\frac{20}{x} = \frac{40}{x} * \frac{60}{x} $$ $$20x = 2400$$ $$x = 120$$
I found this answer unsatisfying, but I am struggling to coherently explain why. I believe there are various assumptions hidden in the above “solution”.
I need help identifying these assumptions or phrasing issues with the question itself that make it not well defined. It could be that I’m mistaken and the problem is well defined and I’ve complicated it.
I am also interested in alternative solutions that could be based on different assumptions but don’t negate the clarifications made.
Let $A_i \thicksim Ber(p)$ be a random variable describing whether or not person $A$ found error $i$, and $B_i$ be the same but for person $B$. The answer posted assumes that $\forall i,j; \mathbb{P}(A_i = 1) = \mathbb{P}(A_j = 1)$, which doesn't feel right. For example, if the errors are typos then the typo: "dwjaiodajwio" is more obvious than using "there" instead of "their". We also should consider types of error, maybe person $B$ is better at finding grammatical error than person $A$, but person $A$ can find all of the spelling errors.
If we choose to assume this, then $\mathbb{P}(A_i = 1) = \frac{40}{x}$ is still incorrect. Let $A \thicksim \text{Bin}(x, \frac{40}{x})$ and $B \thicksim \text{Bin}(x, \frac{60}{x})$. Then we expect $A = 40$ and $B = 60$ given $x$ total errors, but this is of course the expectation, not on any given trial will they be equal. That is the biggest problem here, is that we claim this trial to be equal to the expectation.
The answer given has assumed that the true expectation is equal to the number of errors found (i.e. $A = x \cdot \frac{40}{x} = 40 = \mathbb{E}[A]$). That is the big "hidden assumption" that the answer has without saying. On just one trial, it is ridiculous to assume this, and the other answer from heropup showed an example as to why this becomes a problem if we find $0$ in common. You are certainly correct that this is not a well-defined problem, and it should have these things specified to make sense.
It would be hard to get an estimate on the true probability, since we don't know the number of true errors or how the errors work. In other words, if we had a disease over a country, and we knew there were at least $100$ people sick out of some amount of people, it's hard to estimate the number of sick people when we know literally nothing about the disease. It could be exactly $100$ if the disease is a rare genetic condition, or it could be $100,000$ if the disease was like the common cold, we don't know, and no estimate will exactly feel satisfactory, since we would need major assumptions on the data.
Final edit: What if they found $40$ error's in common, and $A$ still found $40$ while $B$ still found $60$? Then it seems like we should expect there to only be $60$ error's based on the work at hand, but that makes literally no sense to just assume $B$ is perfect.