Chi squared test

99 Views Asked by At

In the Chi Squared Test we build up a statistic $Q$ which converges in law to a $\chi^2$ as the number $n$ of observations goes to infinity. So, if $n$ is "big enough", we choose to approximate $Q$ with the $\chi^2$.

I don't understand this approximation. How can we deduce information on $Q$ by knowing its limiting law? The limit does depend only on an arbitrary tail of the sequence, hence doesn't depend on $Q$.

1

There are 1 best solutions below

2
On

Goodness-of-fit statistic. Suppose you want to test whether a die is fair by rolling it 600 times. Then you would expect, on average, to see each face $E = 100$ times. If the observed counts for faces $i = 1, \dots, 6$ are $X_i,$ then the chi-squared statistic is

$$Q = \sum_{i=1}^6 \frac{(X_i - E)^2}{E} \stackrel{aprx}{\sim} \mathsf{Chisq}(\nu = 6-1=5),$$ the chi-squared distribution with 5 degrees of freedom.

Test at the 5% level. Then we would reject the null hypothesis that the die is fair at the 5% level of significance, if $Q \ge t_c = 11.07,$ where the critical value $q_c$ cuts 5% of the probability from the upper tail of $\mathsf{Chisq}(5).$

qchisq(.95, 5)
[1] 11.0705

Experience has shown that the approximation is reasonably good in such circumstances provided that $E > 5,$ which is true in our case.

Illustration by simulation. A simulation in R of this situation with a fair die is as below. Because we are simulating rolls of a fair die, we expect to reject in about 5% of the 100,000 iterations. The simulated rejection rate is indeed very nearly 5%.

set.seed(710)  # for reproducibility
m = 10^5       # iterations of the 600-roll experiment
q = replicate( m,  
      sum((tabulate(sample(1:6, 600, rep=T))-100)^2/100) )
mean(q > 11.0705)
[1] 0.05101

A histogram of the simulated distribution of $Q$ is a reasonably good fit to the density function of $\mathsf{Chisq}(5).$

hist(q, prob=T, br=40, col="skyblue2")
curve(dchisq(x, 5), add=T, n=1001, col="red", lwd=2)

enter image description here

The statistic $Q$ is discrete because values change by small increments as the counts change at random. However, the continuous chi-squared distribution turns out to be a very good approximation to the distribution of $Q$ in the circumstances illustrated.

Power of the test for a biased die. By contrast, if we simulate using a die that is somewhat biased against showing $1$'s (in favor of $6$'s), then we see that the goodness-of-fit test is very likely to reject the null hypothesis that the die is fair. The power of the test is about 97%.

set.seed(1234)         # for reproducibility
m = 10^5               # iterations of the 600-roll experiment
p = c(2,3,3,3,3,4)/18  # probabilities for biased die
q = replicate( m,  
      sum((tabulate(sample(1:6, 600, rep=T, prob=p))-100)^2/100) )
mean(q > 11.0705)
[1] 0.97453

enter image description here

Notes: (1) Under the null hypothesis that the die is biased with probabilities $p = (2,3,3,3,3,4)/18,$ the statistic $Q$ has the non-central chi-squared distribution with $\nu = 5$ df and 'noncentrality parameter' $\lambda = n\sum_i (p_i - \frac 16)^2/(\frac 16)$ $= 22.22,$ so that the power of the goodness-of-fit test can be computed in R (without simulation) as $0.971.$

1-pchisq(11.0705, 5, 22.22)
[1] 0.9709646

(2) Rough outline of proof of that $Q$ converges in distribution to $\mathsf{Chisq}(\nu = k-1)$. In general, the $E_i$ are not all necessarily the same. If we view $X_i$ as Poisson counts, then $Z_i =(X_i = E_i)/\sqrt{E_i}$ is a standardized Poisson random variable with mean 0 and variance 1. For large grand sample size $n,$ the $E_i$ become large and each $Z_i$ converges in distribution to standard normal. So $Z_i^2$ converge to $\mathsf{Chisq}(1).$ Finally, $Q = \sum_i Z_i^2$ converges to $\mathsf{Chisq}(k-1)$ instead of $\mathsf{Chisq}(k)$ because of the one conditional constraint that $\sum_i X_i = n.$

(3) Essentially same Q&A, except for note (2) and different runs of the simulations.