Proof of Simpson's Paradox

80 Views Asked by At

I am studying Implicit Function Theorem and its application in Simpson's Paradox. I got the following problem. I tried it myself, but not sure if my answer is correct. I would really appreciate it if someone could help me check!

Problem:

A company tests a new medicine in city $C$ and $C'$. In each city, the tests are conducted in two labs, $U$ and $U'$. In each lab, there is a test group ($T$) receiving the new medicine and a control group ($T'$) receiving old medicine. Some people became health ($H$), the other did not ($H'$). The new medicine is judged to be better if a higher percentage of people who took the new medicine becomes health than those who took the old one. There exist samples in which the new medicine is better than the old at each of the four labs and in the aggregate in each city, but worse when aggregated over the whole test population. In other samples, the conclusions oscillate with the level: the new medicine is worse than the old at each of the four facilities, is better in each city, but is worse when aggregated over the whole population, and so forth. Present an analytical proof (not using counterexamples) that each of the above scenario is possible using the Implicit Function Theorem.

My attempt:

Define the following mutually exclusive groups: \begin{equation} S_1 = TCU,\space\space\space\space S_2 = TCU',\space\space\space\space S_3 = TC'U,\space\space\space\space S_4 = TC'U' \\ S_5 = T'CU, \space\space S_6 = T'CU',\space\space S_7 = T'C'U,\space\space S_8 = T'C'U'. \end{equation} Let $x_i = Pr\{H|S_i\}$ and $d_i = Pr\{S_i\}$ for $I = 1, \dots, 8$. Let \begin{equation} y_1 = Pr\{H|TC\},\space y_2 = Pr\{H|TC'\},\space y_3 = Pr\{H|T'C\},\space y_4 = Pr\{H|T'C'\}, \end{equation} aggregating over the type of test lab. Let \begin{equation} z_1 = Pr\{H|T\}\space\space\space\space and\space\space\space\space z_2 = Pr\{H|T'\}, \end{equation} the overall aggregate variables.
We first show that \begin{equation} y_i = \frac{x_{2j - 1}d_{2j - 1} + x_{2j}d_{2j}}{d_{2j-1} + d_{2j}}. \end{equation} Consider the case when $j = 1$, then we want to show \begin{equation} Pr\{H|TC\} = \frac{Pr\{H|TCU\}Pr\{TCU\} + Pr\{H|TCU'\}Pr\{TCU'\}}{Pr\{TCU\} + Pr\{TCU'\}}. \end{equation} By Kolmogorov definition and axiom of conditional probability, we have \begin{equation} Pr\{H|TC\} = \frac{Pr\{H \cap TC\}}{Pr\{TC\}} = \frac{Pr\{H \cap TCU\} + Pr\{H \cap TCU'\}}{Pr\{TC\}} = \frac{Pr\{H|TCU\}Pr\{TCU\} + Pr\{H|TCU'\}Pr\{TCU'\}}{Pr\{TC\}}. \end{equation} But $Pr\{TC\} = Pr\{TCU\} + Pr\{TCU'\} - Pr\{TCU \cap TCU'\} = Pr\{TCU\} + Pr\{TCU'\}$ because $TCU \cap TCU' = \emptyset$. We proved that $y_1 = \frac{x_{1}d_{1} + x_{2}d_{2}}{d_{1} + d_{2}}$. Similarly, we are able to prove that $y_i = \frac{x_{2j - 1}d_{2j - 1} + x_{2j}d_{2j}}{d_{2j-1} + d_{2j}}$ for all $j = 1, 2, 3, 4$.
By an analogous argument, we are able to prove that \begin{equation} z_1 = \frac{\sum_{j = 1}^{4}x_jd_j}{\sum_{j = 1}^{4}d_j}\space\space\space\space and\space\space\space\space z_2 = \frac{\sum_{j = 5}^{8}x_jd_j}{\sum_{j = 5}^{8}d_j}. \end{equation} Now, consider the map $F:[0, 1]^8 \times \Sigma(8) \to \mathbb{R}^7$ defined by \begin{equation} F(x_1, \dots, x_8, d_1, \dots, d_8) = (x_1 - x_5, x_2 - x_6, x_3 - x_7, x_4 - x_8, y_1 - y_3, y_2 - y_4, z_1 - z_2), \end{equation} where $\Sigma(8) = \{(d_1, \dots, d_8) | d_i \geq 0, \sum_{i}d_i = 1, i = 1, \dots, 8\}$. Write $d_8 = 1 - \sum_{i = 1}^{7}d_i$, then we have \begin{equation} F(x_1, \dots, x_8, d_1, \dots, d_7) = (x_1 - x_5, x_2 - x_6, x_3 - x_7, x_4 - x_8, \frac{x_1d_1 + x_2d_2}{d_1 + d_2} - \frac{x_5d_5 + x_6d_6}{d_5 + d_6}, \frac{x_3d_3 + x_4d_4}{d_3 + d_4} - \frac{x_7d_7 + x_8(1 - \sum_{i = i}^{7}d_i)}{d_7 + (1 - \sum_{i = 1}^{7}d_i)}, \frac{\sum_{i = 1}^{4}x_id_i}{\sum_{i = 1}^{4}d_i} - \frac{\sum_{i = 5}^{7}x_id_i + x_8(1 - \sum_{i = 1}^{7}d_i)}{1 - \sum_{1 = 1}^{4}d_i}). \end{equation} Calculate the Jacobian DF: \begin{pmatrix} 1 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ \frac{d_1}{d_1 + d_2} & \frac{d_2}{d_1 + d_2} & 0 & 0 & -\frac{d_5}{d_5 + d_6} & -\frac{d_6}{d_5 + d_6} & 0 & 0 & \frac{(x_1 - x_2)d_2}{(d_1 + d_2)^2} & \frac{(x_2 - x_1)d_1}{(d_1 + d_2)^2} & 0 & 0 & \frac{(-x_5 + x_6)d_6}{(d_5 + d_6)^2} & \frac{(x_5 - x_6)d_5}{(d_5 + d_6)^2} & 0\\ 0 & 0 & \frac{d_3}{d_3 + d_4} & \frac{d_4}{d_3 + d_4} & 0 & 0 & -\frac{d_7}{1 - \sum_{i = 1}^{6}d_i} & -\frac{1 - \sum_{i = 1}^{7}d_i}{1 - \sum_{i = 1}^{6}d_i} & 0 & 0 & \frac{(x_3 - x_4)d_4}{(d_3 + d_4)^2} - \frac{(x_7 - x_8)d_7}{(1 - \sum_{i = 1}^{6}di)^2} & \frac{(x_4 - x_3)d_3}{(d_3 + d_4)^2} - \frac{(x_7 - x_8)d_7}{(1 - \sum_{i = 1}^{6}di)^2} & -\frac{(x_7 - x_8)d_7}{(1 - \sum_{i = 1}^{6}di)^2} & -\frac{(x_7 - x_8)d_7}{(1 - \sum_{i = 1}^{6}di)^2} & -\frac{x_7 - x_8}{1 - \sum_{i = 1}^{6}d_i}\\ \frac{d_1}{\sum_{i = 1}^{4}d_i} & \frac{d_2}{\sum_{i = 1}^{4}d_i} & \frac{d_3}{\sum_{i = 1}^{4}d_i} & \frac{d_4}{\sum_{i = 1}^{4}d_i} & -\frac{d_5}{1 - \sum_{i = 1}^{4}d_i} & -\frac{d_6}{1 - \sum_{i = 1}^{4}d_i} & -\frac{d_7}{1 - \sum_{i = 1}^{4}d_i} & \frac{1 - \sum_{i = 1}^{7}d_i}{1 - \sum_{i = 1}^{4}d_i} & \frac{x_1\sum_{i = 1}^{4}d_i - \sum_{i = 1}^{4}x_id_i}{(\sum_{i = 1}^{4}d_i)^2} - \frac{-x_8(1 - \sum_{i = 1}^{4}d_i) + (\sum_{i = 5}^{7}x_id_i + x_8 - x_8\sum_{i = 1}^{7}d_i)}{(1 - \sum_{i = 1}^{4}d_i)^2} & \frac{x_2\sum_{i = 1}^{4}d_i - \sum_{i = 1}^{4}x_id_i}{(\sum_{i = 1}^{4}d_i)^2} - \frac{-x_8(1 - \sum_{i = 1}^{4}d_i) + (\sum_{i = 5}^{7}x_id_i + x_8 - x_8\sum_{i = 1}^{7}d_i)}{(1 - \sum_{i = 1}^{4}d_i)^2} & \frac{x_3\sum_{i = 1}^{4}d_i - \sum_{i = 1}^{4}x_id_i}{(\sum_{i = 1}^{4}d_i)^2} - \frac{-x_8(1 - \sum_{i = 1}^{4}d_i) + (\sum_{i = 5}^{7}x_id_i + x_8 - x_8\sum_{i = 1}^{7}d_i)}{(1 - \sum_{i = 1}^{4}d_i)^2} & \frac{x_4\sum_{i = 1}^{4}d_i - \sum_{i = 1}^{4}x_id_i}{(\sum_{i = 1}^{4}d_i)^2} - \frac{-x_8(1 - \sum_{i = 1}^{4}d_i) + (\sum_{i = 5}^{7}x_id_i + x_8 - x_8\sum_{i = 1}^{7}d_i)}{(1 - \sum_{i = 1}^{4}d_i)^2} & -\frac{x_5 - x_8}{1 - \sum_{i = 1}^{4}d_i} & -\frac{x_6 - x_8}{1 - \sum_{i = 1}^{4}d_i} & -\frac{x_7 - x_8}{1 - \sum_{i = 1}^{4}d_i} \end{pmatrix} If we take $x_1 = x_5 \neq x_2 = x_6 \neq x_3 = x_7 \neq x_4 = x_8$, DF has rank 7. Let $(\mathbf{x^*}, \mathbf{d^*})$ be a point with the properties \begin{equation} x_1 = x_5 \neq x_2 = x_6 \neq x_3 = x_7 \neq x_4 = x_8 \\ d_1 = \dots = d_8 = \frac{1}{8}. \end{equation} Then, $F(\mathbf{x^*}, \mathbf{d^*}) = 0$ and $DF(\mathbf{x^*}, \mathbf{d^*})$ has maximal rank. By Implicit Function Theorem, $F$ is locally onto a neighborhood of $\mathbf{0}$. In other words, if we choose any sign pattern $(\varepsilon_1, \dots, \varepsilon_7)$, where each $\varepsilon_i = \pm 1$, in the target space $\mathbb{R}^7$ and a point $\mathbf{z} = (z_1, \dots, z_7)$ near $\mathbf{0}$ that realizes this sign pattern, then there exists a point $\mathbf{x'}, \mathbf{d'}$ in $[0, 1]^8 \times \Sigma(8)$, such that $F(\mathbf{x'}, \mathbf{d'}) = \mathbf{z}$. The point $(\mathbf{x'}, \mathbf{d'})$ corresponds to a partitioning of the test population into $S_1, \dots, S_8$ so that the 7-tuple \begin{equation} Pr\{H|TCU\} - Pr\{H|T'CU\}, Pr\{H|TCU'\} - Pr\{H|T'CU"\}, \\ Pr\{H|TC'U\} - Pr\{H|T'C'U\}, Pr\{H|TC'U'\} - Pr\{H|T'C'U'\}, \\ Pr\{H|TC\} - Pr\{H|T'C\}, Pr\{H|TC'\} - Pr\{H|T'C'\}, \\ Pr\{H|T\} - Pr\{H|T'\},\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \end{equation} has the preassigned sign pattern $(\varepsilon_1, \dots, \varepsilon_7)$.

I am not completely sure about my answer, especially the step where I calculated $y_i$ and $z_i$ as well as the step where I picked an $(\mathbf{x^*}, \mathbf{d^*})$. Could someone please help me check? Thanks a lot in advance!