Probability of cheating knowing the occurrence of some probabilistic events

110 Views Asked by At

Assume we have a set of documents $D = \{d_0, d_1, ..., d_n\}$, and another set of independent artefacts $A = \{a_0, a_1, ..., a_{m-1}\}$ where the likelihood of an artefact $a_x$ appearing in a document is $p_x$.

If we have two documents $(d_a, d_b)$ for which we know they share a set of artefacts $\{a_x, a_y, ..., a_z\}$ what is the probability of a this being just a coincidence (or the converse that a copying happened)?

How about the more general question: The case where we have a tuple of documents $(d_a, d_b, ..., d_l)$ for which we know they share a $\{a_x, a_y, ..., a_z\}$?

2

There are 2 best solutions below

0
On

If 2 documents are independent, then they both have a common set with probability = $\left( p_{x} \cdot p_{y} \cdot \dots \cdot p_{z} \right)^2$

1
On

If I understand right what you are after, you want to design a test for catching cheaters with low probability of error. The story, as usual, is that you should determine your "unacceptable low probability event" $E$ beforehand and to decide what is the "alternative hypothesis" to the independence. In this case I assume that "copying" means literal copying, so the copied documents contain exactly the same artefacts. Then your task is to choose some sets of artefacts and cry "cheating" if you see exactly the same of them in 2 documents. If $S$ is a set, the chance that this exact set will appear in some two independent documents is at most $\frac{n(n+1)}2\prod_S p_i^2\prod_{S^c}(1-p_i)^2$.

You want $$ \sum_S\sigma(S)^2=\sum_S \prod_S p_i^2\prod_{S^c}(1-p_i)^2\le\frac{2\alpha}{n(n+1)} $$ where $\alpha$ is the allowed chance of false accusation and $\sigma(S)=\prod_S p_i\prod_{S^c}(1-p_i)$. On the other hand, you want to maximize $\sum_S \sigma(S)$ to accuse as often as you can. Thus, theoretically, you should arrange all sets according to $\sigma(S)$ in the increasing order and choose several first ones.

Your cutoff is then just $\sigma(S)\le B$ (meaning that if you see exactly the same set $S$ of artefacts in 2 documents and this condition holds, then you make an accusation; otherwise you let it pass). $B$ has to be chosen so that the displayed inequality holds (ideally it should become an approximate identity. The always safe level is $B=\frac{2\alpha}{n(n+1)}$ (just from the full probability formula) but you may often be able to go well above that.

Let's say, for instance, that you have $10$ students producing exam papers and $3$ possible common mistakes with the probabilities $1/2,1/4,1/8$. Assume that you can afford to make a false accusation with probability at most $10\%$. Then the RHS in the displayed formula becomes $\frac 1{450}$. A direct computation (just write a short program) shows that you can accuse of cheating if you have two papers with all three mistakes or with no first mistake but the last two present, but not otherwise.

Of course, this makes many implicit assumptions. One most dubious assumption is that your students are all equally good. If two students are really dismal and make all mistakes that are there to make on a regular basis while the probabilities of mistakes are computed as frequencies in an otherwise good class, you run high risk of accusing them falsely. So the actual real life problem is a bit more complicated than this simple exercise in elementary statistics.