Determining the degree of freedom for a $\chi$-squared test

475 Views Asked by At

I have read that the degree of freedom is calculated by subtracting $1$ from the number of states a random variable can be in. I am performing a goodness of fit test on a $64\times 32$ matrix where the expected frequency of any $a[i,j]$ is $50\,000$ and the observed frequency can lie between $0$ and $100\,000$. What I am confused about is that how do I calculate the degree of freedom? Since the observed value might range from $0$ to $100\,000$, will my degree of freedom be equal to $100\,000-1$? Please advise.

1

There are 1 best solutions below

7
On

If you are doing a chi-squared goodness-of-fit (GOF) test for data in a matrix with $r$ rows and $c$ columns, and finding the expected count in a cell as (row total)(column total)/(grand total), then $df = (r-1)(c-1).$

Degrees of freedom depend on the numbers of row and column categories, not on the observed and expected counts in the cells.

Note: That said, I have never done a chi-squared GOF test for counts in a matrix anywhere near as large as the one you are talking about. I think you should read about the assumptions of the GOF test and make sure they apply in your situation. If you have doubts, perhaps describe your situation, data, and goals on our sister 'statistics' (or 'crossvalidated') site, and ask whether there is a better way toward your goals. That site tends to get more people with active experience in 'big data' applications.

I'm not saying you are doing the wrong analysis, but something seems to be confusing you, and I'm not sure your simply-resolved question here is the one you really should be asking.

Addendum (posted later, based on information in Comments): I had a look at the paper you linked. It is not exactly entry level material for the main subject matter, which I will not pursue here. However, I think I have a clearer view of what you are trying to do with a statistical test.

Chi-squared test. The chi-squared GOF test you propose is based on $rc = 2048$ $X$-values, each with expectation $E = 50,000.$ For purposes of the test, you essentially ignore the matrix structure because you do not use it to get $E$ (already specified for each cell). Thus, your GOF statistic turns out to be

$$Q = \sum_{i=1}^{rc} \frac{(X_i - E)^2}{E}.$$

Under the null hypothesis that $E(X_i) \equiv E$, the test statistic is approximately $Chisq(rc).$ (The distinction between $df = rc$ and $df = rc - 1$ would hardly matter in practice, but the former is correct because you are not using your $X$-values to estimate $E$, nor using the total of the $X_i$.)

An assumption of the test is that $X_i$ are approximately normal so that $Z_i = (X_i - E)/\sqrt{E}$ is approximately standard normal, $Z_i^2 = (X_i - E)^2/E$ is approximately $Chisq(df=1)$, and $Q$ is approximately $Chisq(df=rc).$ Thus one would reject $H_0$ at the 5% level, if $Q \ge 2154.4,$ the value that cuts 5% from the upper tail of $Chisq(df = rc)$.

If the $X_i$ are counts distributed $Pois(\lambda = E),$ then $E(X_i) = E,\;$ $V(X_i) = E,\,$ and $SD(X_i) = \sqrt{E}.$ Certainly, the discrete distribution $Pois(50,000)$ is well approximated by $Norm(50,000, \sqrt{50,000}).$

Normal test. A simpler and somewhat similar test (of the null hypothesis that cell means average $E = 50,000$) would use the statistic $Z = (\bar X - E)/\sqrt{E/rc},$ where $\bar X$ is the sum of the $X_i.$ Under the same assumptions as above, $Z$ is approximately standard normal. Thus, one would reject $H_0$ at the 5% level if $|Z| \ge 1.96.$

The following simulation in R of $m = 10,000$ tests of each type shows that they do have a significance level near 5%, when the $X_i \sim Pois(50,000).\;$ [A larger $m$ would get results a little closer to 5%; but not exactly, because the tests themselves are based on continuous distributions approximating discrete observations.]

 m = 10000;  Q = Z = numeric(m)
 E = 50000;  k = 64*32;  se = sqrt(E/k);  c = qchisq(.95, k) 
 for(i in 1:m) {
   X = rpois(k, 50000)
   Q[i] = sum((X-E)^2/E)
   Z[i] = (mean(X) - E)/se }

 mean(Q);  sd(Q); mean(Q > c)
 ## 2046.720  # aprx rc = k = 2048
 ## 63.77823  # aprx sqrt(2k) = 64
 ## 0.0483    # aprx 5% signif level: P(Rej Ho | Ho true)

 mean(Z); sd(Z);  mean(abs(Z) > 1.96)
 ## 0.001443955  # aprx E(Z) = 0
 ## 1.005732     # aprx SD(Z) = 1
 ## 0.0516       # aprx 5% signif level

The figure below on the left shows simulated values of $Q$ along with the density of $Chisq(df = rc)$; the area to the right of the vertical red line is 5%. On the right are simulated values of $Z$ along with the standard normal density curve; areas outside of the vertical red lines add to 5%.

 par(mfrow = c(1,2))    # 2-panel graph
   hist(Q, prob=T, col="wheat", ylim=c(0,.007))
     curve(dchisq(x, k), col="blue", add=T)
     abline(v = qchisq(.95, k), col="red")
   hist(Z, prob=T, col="skyblue2")
     curve(dnorm(x), col="blue", add=T)
     abline(v = c(-1.96, 1.96), col="red")
 par(mfrow = c(1,1))   # return to default graphs

enter image description here