Difference between fractions at group level have different sign than difference between fractions in aggregate

168 Views Asked by At

I have obtained a result (perhaps incorrectly; we shall find out) that appears paradoxical.

Suppose I am interested in comparing fractions between 'groups' (not in the strict mathematical sense) $\alpha$ and $\beta$. Within these groups, there are two subgroups (which are present in both $\alpha$ and $\beta$), denoted $y$ and $z$.

Without loss of generality and for sake of argument, suppose that we're interested in United States baseball batting averages, $AVG$, and have data on hits $H$ and at-bats $AB$.

Specifically,

$(H_{\alpha,y}, AB_{\alpha,y}) = (34836, 268206) \Rightarrow AVG_{\alpha,y} = \frac{34836}{268206} = 0.129885237 \ldots$

$(H_{\alpha,z}, AB_{\alpha,z}) = (81311, 366970) \Rightarrow AVG_{\alpha,z} = \frac{81311}{366970} = 0.221573970 \ldots$

$(H_{\beta,y}, AB_{\beta,y}) = (33463, 253042) \Rightarrow AVG_{\beta,y} = \frac{33463}{253042} = 0.132242868 \ldots$

$(H_{\beta,z}, AB_{\beta,z}) = (69498, 312624) \Rightarrow AVG_{\beta,z} = \frac{69498}{312624} = 0.222305389 \ldots$

Note the differences in $AVG$ between groups with the same subgroup,

$d_{y} = AVG_{\alpha,y} - AVG_{\beta,y} = -0.0024 \ldots$

$d_{z} = AVG_{\alpha,z} - AVG_{\beta,z} = -0.0007 \ldots$

However, the difference in $AVG$ between groups, without regard for subgroup has a sign that is not intuitive for me,

$d_{y+z} = AVG_{\alpha, y+z} - AVG_{\beta, y+z} = \frac{(H_{\alpha,y} + H_{\alpha,z})}{(AB_{\alpha,y} + AB_{\alpha,z})} - \frac{(H_{\beta,y} + H_{\beta,z})}{(AB_{\beta,y} + AB_{\beta,z})} = +0.0008 \ldots$

My expectation is that $d_{y+z} \in [\min{(d_y, d_z)}, \max{(d_y, d_z)}]$

Why is the difference in $AVG$ between groups not bounded by the subgroup-level differences?

2

There are 2 best solutions below

1
On BEST ANSWER

This is Simpson's Paradox. (see https://en.wikipedia.org/wiki/Simpson%27s_paradox)

The averages hide the quantity of each subset. Your calculations are probably fine, but your expectations hit a well-known paradox.

For ease of explanation, I will be using inequalities $x>y$ rather than the equivalent sign of the difference $x-y>0$.

Consider comparing T-ball players to professional baseball players. We have tee-ball players with 90% hit rate on a tee, but 0% hit rate against a professional pitcher. The professionals might have 100% hit rate on a tee, but 30% hit rate against a professional. The professionals perform better in both cases.

However, most tee-ball players don't bat against professionals, and most professionals don't bat on a tee. We might have .0001% of tee-ball players batting against professionals, and .00001% of professionals using a tee.

Then the average over both cases (all at-bats) would be extremely close to 90% for tee-ball players, and very close to 30% for professionals. The tee-ball players perform better overall.

The conclusion from the overall average is not so much that tee-ball players are better than professionals, but rather that tee-ball players hit stationary balls better than professionals hit 90mph pitches.

0
On

This apparent paradox depends on the fact that, when you group two or more prevalences (e.g., proportions) observed in subsets, the resulting prevalence in the overall population is not the arithmetic mean of these prevalences, but rather a weighted mean (in particular, weighted by the size of each subset).

For example, let us take a group $A$ of size $8$ divided in two subsets $A_1$ of size $3$ and $A_2$ of size $5$. Let us hypothesize that a certain property is present in one item of subset $A_1$ and one item of subset $A_2$. The prevalences in $A_1$ and $A_2$ are then $\frac{1}{3}$ and $\frac{1}{5}$, and the resulting overall prevalence in $A$ is $\frac{1+1}{5+3}=\frac{2}{8}=\frac{1}{4}=0.25$. Note that this weighted value is smaller than the arithmetic mean, that would have been $\frac{4}{15}$: this occurs because the larger subset $A_2$ (with lower prevalence) "weights" more than the subset $A_1$.

Now consider a group $B$ of size $1003$ divided in two subsets $B_1$ of size $3$ and $B_2$ of size $1000$. Let us hypothesize that the above mentioned property is present in one item of subset $B_1$ and $200$ item of subset $B_2$. The prevalences in $B_1$ and $B_2$ are then $\frac{1}{3}$ and $\frac{200}{1000}=\frac{1}{5}$, i.e. equal to the subsets $A_1$ and $A_2$. However, the resulting overall prevalence in $B$ is $\frac{1+200}{3+1000}=\frac{201}{1003} \approx 0.2004$. Again, this weighted value is smaller than the arithmetic mean. More importantly, despite the identical prevalences of subsets, this case results in a considerably smaller overall prevalence as compared to that obtained for group $A$. This depends on the fact that the very large subset $B_2$ (with lower prevalence) "weights" very much, and then leads to a relatively smaller overall prevalence. The overall prevalence, in this case, is very near to that of the largely predominant subset.

These two examples show why two subsets with given prevalences can result in very different overall group prevalences, depending on the relative sizes of the subsets.

This also explains how two subsets $A_1$ and $A_2$, with prevalences that are both lower than those of other two subsets $B_1$ and $B_2$, respectively, can give a paradoxically higher overall prevalence. This can occur, for instance, when among two subsets of the same group there is a larger size of the subset with lower prevalence, whereas the opposite occurs in the other group. Varying the example above, imagine that group $A$ has size $1003$ and is divided in two subsets $A_1$ of size $990$ and $A_2$ of size $13$, with prevalences of a given property of $\frac{300}{990} \approx 0.303$ and $\frac{1}{13} \approx 0.077$, respectively. Let us compare this group with the same group $B$ described above, having size $1003$ and including two subsets with prevalences in $B_1$ and $B_2$ of $\frac{1}{3}$ and $\frac{200}{1000}=\frac{1}{5}$, respectively. Despite the prevalence in $A_1$ is lower than that in $B_1$, and that in $A_2$ is lower than that in $B_2$, the overall prevalence in $A$ is $\frac{301}{1003} \approx 0.30$, i.e. much larger than the $0.20$ prevalence of group $B$. This depends on the fact that the subset $A_1$, with considerably larger size and higher prevalence than $A_2$, weights much more and leads to a relatively high overall prevalence. The opposite occurs in group $B$ where the subset $B_2$, with considerably larger size but lower prevalence than $B_1$, weights much more and leads to a relatively low overall prevalence.