We are given that the members of a community are classified by blood type according to the following schema:
\begin{array}{|c|c|c|c|c|} \hline O& A & B & AB & Total \\ \hline 121 & 120 & 79 & 33 & 353\\ \hline \end{array}
We are also given that the probabilities of the above blood types depend on gene frequency parameters $r, p, q$ satisfying the following relations:
$$ r + p + q = 1 $$ $$ P(\text{O}) = r^2 $$ $$ P(\text{A}) = p^2 + 2pr $$ $$ P(\text{B}) = q^2 + 2qr $$ $$ P(\text{AB}) = 2pq $$ Finally, we are given the MLEs of the $p,q, r$, which are: $$ \hat{r} = 0.580 $$ $$ \hat{p} = 0.246 $$ $$ \hat{q} = 0.173 $$ My question is, what is the approach to testing if the given community fits the theory. Typically the approach is to use a $\chi^2$ test for goodness of fit but I cannot see how it would apply in this scenario. Any suggestions will be deeply appreciated.
UPDATE: It was brought to my attention that I am not exactly stating what aspect of the test I am not understanding. Here is that information:
- I am assuming that $H_1$ in this case will be that the proposed theory fits the data and $H_0$ that it does not.
- There does not seem to be a preferred level of significance so I goes 5% should be acceptable (if there is motivation to choose a different one, please let me know)
- The degrees of freedom will be 3(?)
- For expected frequencies I should have e.g. $0 \times P(\text{0}) = 121 \times 0.580^2 $ and so on.
- The observed data is given and the test statistic is easy to compute.
Is the summary I have provided the way to tackle this or am I missing some aspect of the problem?
UPDATE 2: It was mentioned to me that the degrees of freedom should be less than 3 because of the MLEs. It is not clear to me how this happens exactly. Could someone elaborate on this aspect of the problem?
FINAL UPDATE: I think that considering the comments in here and my sketch of the approach suffices to solve the problem. My thanks to everyone who offered their expertise!
Your null hypothesis is that Hardy-Weinberg equilibrium has been reached for blood types in the population in which the blood samples were drawn.
You have observed counts $\#O = 121,\,$ $\#A = 120,\,$ $\#B = 79,\,$ and $\#AB = 33.$ At equilibrium, the probabilities are as shown in your question and the expected counts are derived using the MLEs:
The respective expected counts are $E_O = n\hat r^2 = 353*(.580)^2 = 118.7492\,$ $E_A = n(\hat p^2 + 2\hat p \hat r) = 353(.246^2 + 2(.246)(.580)) = 122.0942\,$ $E_B = n(\hat q^2 + 2\hat q \hat r) = 353(.173^2 + 2(.173)(.580)) = 81.4050\,$ and $E_{AB} = n(2 \hat p \hat q) = 353(2(.245)(.173)) = 29.9238.$
As a check we verify that the expected counts also sum to 353 (within rounding error): $118.7492+122.0942+81.4050+29.9238 = 352.1722.$
The ch-squared goodness-of-fit statistic is $$Q = \sum_{i=1}^4 \frac{(X_i - E_i)^2}{E_i}.$$ Where the observed counts are designated $X_i$.
Using R, we find that $Q = 0.4659$.
Under the null hypothesis the approximate distribution of $Q$ is $Q \sim Chisq(df = 1).$ The rationale for having only one degree of freedom is that the MLEs are based on estimating the proportion of A and B alleles in the population. Without estimation we would have had $df = 4-1 = 3,$ but we lose a degree of freedom for each quantity directly estimated.
The P-value of the test is the probability $0.495$ under the $Chisq(1)$ density curve to the right of $0.4659$. Thus the agreement of observed counts with expected counts is good enough to be considered consistent with equilibrium.
Below is a plot of the PDF of $Chisq(1)$ with a vertical dotted line at the observed value of the GOF statistic. (This is a 'heavy-tailed' distribution with more probability towards higher values than may be apparent from the plot.)