Minimal example of Simpson's paradox

356 Views Asked by At

Let's say that a finite probability space $(\Omega,\mathscr P(\Omega),P)$ has Simpson's property if you can find events $A,B,C\in\mathscr P(\Omega)$ such that

  • $P(C) \in (0,1)$.
  • $A$ and $B$ are positively correlated: $P(A\cap B) > P(A)P(B)$.
  • $A$ and $B$ are negatively correlated conditionally to both $C$ and $\overline C$: $$P(A\cap B\mid C) < P(A\mid C) P(B\mid C) \text{ and } P(A\cap B\mid\overline C) < P(A\mid\overline C) P(B\mid\overline C).$$

One way to state Simpson's paradox is that there are probability spaces with Simpson's property. The cat-vs-human example given in this nice video, for instance, boils down to this:

(The four small points each have a probability of $1/10$, and the two big ones weigh $3/10$ each).

My (very naïve and probably not very interesting) question is to know if it is possible to find a smaller example (with fewer points) and, if so, to find a provably minimal example.

1

There are 1 best solutions below

1
On BEST ANSWER

$\def\c{^\mathrm{c}}\def\peq{\mathrel{\phantom{=}}{}}\def\Ω{{\mit Ω}}$First, define$$ p_0 = P(A\c \cap B\c \cap C\c),\ p_1 = P(A\c \cap B\c \cap C),\ p_2 = P(A\c \cap B \cap C\c),\ p_3 = P(A\c \cap B \cap C),\\ p_4 = P(A \cap B\c \cap C\c),\ p_5 = P(A \cap B\c \cap C),\ p_6 = P(A \cap B \cap C\c), p_7 = P(A \cap B \cap C), $$ then $\sum\limits_{k = 0}^7 p_k = 1$. Since $P(C) = p_1 + p_3 + p_5 + p_7$, then condition 1 is equivalent to$$ 0 < p_1 + p_3 + p_5 + p_7 < 1. \tag{1} $$ Analogously, condition 2 is equivalent to$$ p_6 + p_7 > (p_4 + p_5 + p_6 + p_7) (p_2 + p_3 + p_6 + p_7), \tag{2} $$ and condition 3 is equivalent to$$ \qquad \begin{cases} p_7 (p_1 + p_3 + p_5 + p_7) < (p_5 + p_7) (p_3 + p_7) & \qquad (3)\\ p_6 (p_0 + p_2 + p_4 + p_6) < (p_4 + p_6) (p_2 + p_5) & \qquad (4) \end{cases} $$

Note that $p_2 + p_3 + p_4 + p_5 = 1 - (p_0 + p_1 + p_6 + p_7)$, thus\begin{align*} &\peq (p_4 + p_5 + p_6 + p_7) (p_2 + p_3 + p_6 + p_7) - (p_6 + p_7)\\ &= (p_6 + p_7)^2 + (p_2 + p_3 + p_4 + p_5) (p_6+ p_7) + (p_2 + p_3) (p_4 + p_5) - (p_6 + p_7)\\ &= (p_6 + p_7)^2 - (p_0 + p_1 + p_6 + p_7) (p_6 + p_7) + (p_2 + p_3) (p_4 + p_5)\\ &= (p_2 + p_3) (p_4 + p_5) - (p_0 + p_1) (p_6 + p_7), \end{align*} which implies that (2) is equivalent to$$ (p_2 + p_3) (p_4 + p_5) < (p_0 + p_1) (p_6 + p_7). \tag{$2'$} $$ Because $(p_5 + p_7) (p_3 + p_7) - p_7 (p_1 + p_3 + p_5 + p_7) = p_3 p_5 - p_1 p_7$, then (3) is equivalent to$$ p_1 p_7 < p_3 p_5. \tag{$3'$} $$ Analogously, (4) is equivalent to$$ p_0 p_6 < p_2 p_4. \tag{$4'$} $$

Now, (3') implies that $p_3, p_5 > 0$, and (4') implies that $p_2, p_4 > 0$. Since (2') implies that $p_0 + p_1 > 0$ and $p_6 + p_7 > 0$, then at least one of $p_0$ and $p_1$ is positive, and at least one of $p_6$ and $p_7$ is positive. Therefore, at least six of $p_0, \cdots, p_7$ is positive. Note that for each positive $p_k$, the event associated with $p_k$ contains at least one element, thus $|\Ω| \geqslant 6$, i.e. the sample space contains at least six elements.