Categorical data: Testing difference between two experiments

65 Views Asked by At

I have the following experimental setup: Protein A is capable of cutting protein B in small fragments. The small fragments are identified and the nature of the last amino acid in each fragment is counted. Thus, in one experiment it is possible to detect all 20 amino acids but with a different total count. The total count depends on the nature of Protein A and the conditions of the experiment. At the end, for the two conditions tested I end up with a table like this:

Amino-acid  Exp1   Exp2
A             0      3
R            20     12
G            10     15
H            14     22
E             5      0

with entries for all 20 amino acids and I also know the total number of fragments from Protein B that were identified in each condition.

The question I need to answer is: Are the amino acids frequencies significantly different under the two experimental conditions?

First I thought to use a chi-square test since with the chi-square test I can take into account the different number of fragments that were identified in the two conditions. But inevitably I will end up with expected values being 0 and thus I cannot use the chi-square test.

Could you please point me in the direction of the test that can be used in this case?

Thanks a lot in advance.

1

There are 1 best solutions below

9
On BEST ANSWER

You seem to count occurrences of five amino acids under two sets of conditions. To do a chi-squared test of homogeneity (each amino acid equally likely to occur under the two conditions), you can find the chi-squared statistic

$$Q = \sum_{i=1}^2 \sum_{j=1}^5 \frac{(X_{ij} - E_{ij})^2}{E_{ij}},$$ where $i$ designates experiment and $j$ amino acid, and each $E_{ij}$ is the total for experiment $i$ times the total for amino acid $j$ divided by the grand total of all ten counts. For example, $E_{11} = 49(3)/101 = 1.455.$

Here is the data matrix with each experiment in a row.

MAT = matrix(c( 0, 20, 10, 14, 5,
                3, 12, 15, 22, 0), nrow=2, byrow=T)

MAT
     [,1] [,2] [,3] [,4] [,5]
[1,]    0   20   10   14    5
[2,]    3   12   15   22    0

Here are the $E_{ij}:$

ChisqOut = chisq.test(MAT);  ChisqOut$exp
Warning message:
In chisq.test(MAT) : Chi-squared approximation may be incorrect
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 1.455446 15.52475 12.12871 17.46535 2.425743
[2,] 1.544554 16.47525 12.87129 18.53465 2.574257

If all of the $E_{ij}$ exceeded $5,$ then under the null hypothesis that the two experiments produce the same distribution of amino acid counts, the chi-squared statistic $Q$ would have approximately a chi-squared distribution with $(r-1)(c-1) = (2-1)(5-1) = 4$ degrees of freedom. The warning message is triggered because expected counts for amino acids A and E are too small. However, R statistical software can do a simulation to approximate the actual distribution of $Q.$ This makes it possible to do a test anyhow, even though the 'chi-squared statistic' is not exactly 'chi-squared distributed':

chisq.test(MAT, sim=T)

        Pearson's Chi-squared test with simulated p-value (based on 2000 replicates)

data:  MAT
X-squared = 12.7, df = NA, p-value = 0.01284

(A couple more tries yielded similar simulated P-values.) Thus it seems that we can reject the null hypothesis of homogeneity at about the 1% or 2% level of significance.

Ordinarily, when the null hypothesis is rejected one looks at the 'Pearson residuals' in each of the $rc = 10$ cells seeking residuals greater than about 2 in absolute value, thus pointing to particular data cells of interest as contributing markedly to the significant result. But there are no such residuals here:

ChisqOut$resi
          [,1]      [,2]       [,3]       [,4]      [,5]
[1,] -1.206418  1.135807 -0.6112371 -0.8291977  1.652835
[2,]  1.171101 -1.102557  0.5933434  0.8049232 -1.604449

As one might suspect from the positions of the 0's for amino acids A and E, the largest components in the sum $Q$ come from those amino acids. Because you have so little data on these two amino acids, I am reluctant to encourage you to speculate on whether they really do behave differently under your two experimental conditions.

One common 'cure' for too-small values of $E_{ij}$ is to combine categories. Perhaps combine amino acids A & R and H & E, but I don't know enough about your experiment to contemplate whether this makes any sense. (Maybe there are amino acids that are in some way 'similar' so that combining small-count ones with larger-count ones would make sense.)

As is often the case, it would be helpful if you had more data: more 'fragments' in your experiments, and thus larger expected counts and greater assurance in drawing particular conclusions of interest.