Pearson's Test for contingency tables

151 Views Asked by At

This topic has been discussed in this forum, but I don't think the problem has been addressed completely.

To apply Pearson's test to a contingency table one computes $$\sum{ {\rm (Observed - Expected)}^2 \over {\rm Expected}}$$ and argues that under certain conditions this statistic has an approximate $\chi^2$ distribution.

The obvious question is: why don't we divide by Expected$^2$? The answer that in this way we get an (approximate) sum of standard normals in my view is not acceptable: we should do what must be done and if what we get is not nice, well, so be it. In fact, what can happen is this. Suppose that we take the original sample and split it in two, keeping the same proportions, i.e. the count of each cell is the same proportion of the total count as it was before splitting. In this case, the value of the statistic is half of what it was before, and it may well happen that a non-rejected H null is now rejected. How is this reasonable? Shouldn't the dependence / independence determination be the same in both cases? I haven't found a discussion of this aspect anywhere in the literature. Pointers would be welcome. Thanks.

1

There are 1 best solutions below

2
On

Suppose, as an approximation, one takes the count $X_{ij}$ in cell $(i,j)$ of an $r \times c$ contingency table to be Poisson with $E_{ij}$ as mean. Then the 'standard score' $\frac{X_{ij} - \mu_{ij}}{\sigma_{ij}}$ for that cell is estimated by $Z_{ij}= \frac{X_{ij} - E_{ij}}{\sqrt{E_{ij}}}$ because the mean $\mu$ and variance $\sigma^2$ of a Poisson distribution are numerically equal.

It follows that $Z_{ij}^2= \frac{(X_{ij}-E_{ij})^2}{E_{ij}}.$ And that is the answer to your question.

For sufficiently large counts, $Z_{ij}^2$ is approximately the square of a standard normal random variable and thus $\mathsf{Chisq}(df=1).$ If we had independent estimates $E_{ij}$ for each cell then the sum $Q = \sum_i \sum_j Z_{ij}^2$ would be approximately distributed as $\mathsf{Chisq}(df=rc).$

However, the $E_{ij}$ are estimated in terms of row, column, and grand totals. This puts linear restrictions on $Q$ so that it is approximately distributed as $\mathsf{Chisq}(df=(r-1)(c-1)).$

Originally, the distribution theory was justified by an intuitive argument not much more sophisticated than what I have written here. Later, theorems in convergence of conditional measures made the argument more rigorous. Even so, there is still the issue of the speed of convergence to $\mathsf{Chisq}((r-1)(c-1)).$ That has been settled mainly by simulation studies, resulting in rules of thumb such as 'all $E_{ij} > 5$' or 'most $E_{ij} > 5$ and all greater than 3'.