Pearson's $\chi^2$-test for independence in this case (table with percentage) correct?

73 Views Asked by At

I found this table:

Type/Income| very low |  low  | middle low | middle high |  high  | very high 
Food           38%       34%        30%           28%       26%       19%
Non-food       62%       66%        70%           72%       74%       81%

I found interesting the fact that no matter how much income one has, the amount used (percentagelike) in food and non-food products doesn't seem to change significantly. I would like to test for this. I used a Pearson's $\chi^2$-test and made a table of expected percentages (using the typical method of multiplying and dividing by the total which is 600%)

Type/Income| very low |  low  | middle low | middle high |  high  | very high 
Food           29%       29%        29%           29%       29%       29%
Non-food       71%       71%        71%           71%       71%       71%

The statistic $\chi^2=\sum_{\mbox{cells}} (O-E)^2/E$ gave me: $$\chi^2 = 0.10739$$ which I guess I have to see as: $10.739\%$. The critical value at $\alpha=0.05$ with $(n-1)(m-1)=5$ degrees of freedom is $\chi_{0.05}^2 = 11.1$ so we fail to reject the null hypothesis and conclude that the income has no effect on percentage used in food/non-food products.

Is all this reasoning correct? Because I am used to this kind of test for tables with "indivivuals" in each category, but it seems reasonable to me to use it for percentage. Are the assumptions reasonable too?

If this were not the path to follow. How could I test for such hypothesis?

Thank you very much for any help or information you may have!

1

There are 1 best solutions below

4
On BEST ANSWER

Sorry, but your procedure is not correct. In fact, no correct chi-squared test for independence can be performed if only percentages are available.

The observed and expected quantities used to compute the $\chi^2$ statistic must be counts, not percentages or fractions. Let me illustrate with a simple example:

Suppose we have three columns (levels of one categorical variable, maybe education), two rows (levels of a second categorical variable, maybe party), and we have fractions of people surveyed who prefer Candidate A in the cells of a matrix. Specifically, consider the matrix FRAC below.

FRAC = rbind(c(.1,.2,.3),c(.5,.5,.4));  FRAC
     [,1] [,2] [,3]
[1,]  0.1  0.2  0.3
[2,]  0.5  0.5  0.4

First, let's suppose each cell is based on 30 people. Then matrix MAT1 shows the counts, and a chi-squared test (with P-value > .05) does not show significant evidence of association. [The null hypothesis that categorical variables Education and Party are independent (as to preference for Candidate A) cannot be rejected.]

MAT1 = 30*FRAC;  MAT1
     [,1] [,2] [,3]
[1,]    3    6    9
[2,]   15   15   12
chisq.test(MAT1)

        Pearson's Chi-squared test

data:  MAT1 
X-squared = 3.1973, df = 2, p-value = 0.2022

By contrast, if each cell is based on 100 people, then we have MAT2 of counts, and highly significant evidence of association.

MAT2 = 100*FRAC; MAT2
     [,1] [,2] [,3]
[1,]   10   20   30
[2,]   50   50   40
chisq.test(MAT2)

        Pearson's Chi-squared test

data:  MAT2 
X-squared = 10.6576, df = 2, p-value = 0.00485

Notice that in a chi-squared test the degrees of freedom are based on the number of levels of the two categorical variables, not on the sample size. (Sample size does enter into the formula for the power of the chi-squared test.)

This is one reason that a bar chart based on percentages is not suitable for publication unless count information is provided in a caption or on axes. Bar charts for MAT1 and Mat2 would be identical, except for information on a count axis.

If you are confident that your percentages are based on hundreds of people (rather than a few dozen) then there may be evidence of association, but there is no way to know for sure in a qualitative sense, and certainly no hope of getting a P-value for a chi-squared test.