What is the mathematical derivation for the chi-square homogeneity test formula?

229 Views Asked by At

If we have J multinomial mutually independent random variables $N_j = (N_{1j},...,N_{Ij}) \sim Mult(n_j, \;p_{1j},...,\;p_{I-1,j}), \; \; j= 1,...,J$, and the test's hypotheses are

  • $H_0: \forall i \in \{1,...,I\}, \;p_{i1}=...= p_{iJ} \;$
  • $H_1: \exists i \in \{1,...,I\} \;so \;that\; p_{ij} \neq p_{ij^{'}} \; for \; some \; j \neq j^{'} $,

the homogeneity test formula is given by: $$ \sum_{j=1}^{J}\sum_{i=1}^{I} \frac{(N_{ij}-n_j \cdot \hat{p}_{i})^2}{n_j \cdot \hat{p}_{i}} \rightarrow \chi_{(J-1)(I-1)}^2, when \; n\; tends \; to \; \infty, $$ where $\rightarrow$ denotes asymptotic convergence in the probability distribution, and $\hat{p}_{i} = \frac{1}{n} \sum_{j=1}^{J} N_{ij}, \; \; i= 1,...,I \;$ is an estimator of the value of $p_i$ (under $H_0$) of $p_{ij}$.

Where does this formula come from?

2

There are 2 best solutions below

0
On BEST ANSWER

This can be roughly based on the following steps:

  • For fixed $j$, $$\sum_{i=1}^I \frac{(N_{ij}-n_jp_{ij})^2}{n_jp_{ij}} \stackrel{d}\longrightarrow \chi^2_{I-1} \quad\text{ as }\quad n_j\to \infty \tag{1}$$

  • The above holds independently for every $j$. Therefore summing over $j$,

$$\sum_{j=1}^J\sum_{i=1}^I \frac{(N_{ij}-n_jp_{ij})^2}{n_jp_{ij}} \stackrel{d}\longrightarrow \chi^2_{J(I-1)} \quad\text{ as }\quad n=\sum_{j=1}^Jn_j\to \infty\tag{2}$$

Under $H_0:p_{i1}=\cdots=p_{iJ}=p_i$ (say), $(2)$ is rewritten as

$$\sum_{i=1}^I \sum_{j=1}^J\frac{(N_{ij}-n_jp_i)^2}{n_jp_i} \xrightarrow[H_0]{d} \chi^2_{J(I-1)} \quad\text{ as }\quad n\to \infty$$

  • When $p_i$ is estimated by $\hat p_i$, we lose $I-1$ degrees of freedom:

$$\sum_{i=1}^I \sum_{j=1}^J\frac{(N_{ij}-n_j\hat p_i)^2}{n_j\hat p_i} \xrightarrow[H_0]{d} \chi^2_{J(I-1)-(I-1)}\equiv \chi^2_{(J-1)(I-1)} \quad\text{ as }\quad n\to \infty \tag{3}$$

This is because only $I-1$ of the $p_i$'s are estimated, since $\sum_{i=1}^I p_i=1$.


I can give you an outline of a derivation of $(1)$. Since $j$ is fixed, we can drop the suffix $j$ here.

Suppose $(N_i)_{1\le i\le I}$ has a $\text{Multinomial}(n;p_1,\ldots,p_I)$ distribution.

Let $\boldsymbol p=(p_i)_{1\le i\le I-1}$. The random vector $\boldsymbol N=(N_i)_{1\le i\le I-1}$ has a nonsingular multinomial distribution (note that $\sum_{i=1}^I N_i=n$) with mean vector $n\boldsymbol p$ and dispersion matrix $n\Sigma$ where $\Sigma=\operatorname{diag}(\boldsymbol p)-\boldsymbol p\boldsymbol p^T$.

By multivariate central limit theorem, it can be argued that as $n\to \infty$, $\sqrt n\left(\frac{\boldsymbol N}n -\boldsymbol p\right)$ converges in law to a $(I-1)$-dimensional normal distribution with mean vector $\boldsymbol 0$ and dispersion matrix $\Sigma$.

It follows that

$$n\left(\frac{\boldsymbol N}n-\boldsymbol p\right)^T\Sigma^{-1}\left(\frac{\boldsymbol N}n-\boldsymbol p\right)=(\boldsymbol N -n\boldsymbol p )^T (n\Sigma)^{-1} (\boldsymbol N -n\boldsymbol p)\stackrel{d}\longrightarrow \chi^2_{I-1} \quad\text{ as }\quad n\to \infty$$

And by expanding the quadratic form, one can actually show that

$$(\boldsymbol N -n\boldsymbol p )^T (n\Sigma)^{-1} (\boldsymbol N -n\boldsymbol p)=\sum_{i=1}^I \frac{(N_i-np_i)^2}{np_i}$$

0
On

One way to think about this problem is that if we treat $$N_{ij}\sim\text{Pois}(\lambda_{ij}).$$

With each $N_{ij}$ independent of the other it can be shown that when conditioned on $$\sum_{i,j}N_{ij}=n$$ it results in a multinomial with $p=\{\frac{\lambda_{km}}{\sum_{i,j}\lambda_{ij}}\}_{(k,m)}$.

Intuitively it makes sense to set $\lambda_{ij}=np_{ij}$, and this will allow us to get the appropriate multinomial. Then using the central limite theorem $$\frac{N_{ij}-E[N_{ij}]}{\sqrt{Var[N_{ij}]}}\sim N(0,1).$$Since this a poisson:

$$\frac{N_{ij}-np_{ij}}{\sqrt{np_{ij}}}\sim N(0,1).$$

We can replace $p_{ij}$ with our MLE estimate $\frac{n_{i+}n_{+j}}{n^2}$and bank on strong consistency to get us the approximation. We can square to get the $\chi^2$ approximation. The degrees of freedom are a result of the constraints on the multinomial fixed to a total of n.