Test if the theory fits the data

84 Views Asked by At

We are given that the members of a community are classified by blood type according to the following schema:

\begin{array}{|c|c|c|c|c|} \hline O& A & B & AB & Total \\ \hline 121 & 120 & 79 & 33 & 353\\ \hline \end{array}

We are also given that the probabilities of the above blood types depend on gene frequency parameters $r, p, q$ satisfying the following relations:

$$ r + p + q = 1 $$ $$ P(\text{O}) = r^2 $$ $$ P(\text{A}) = p^2 + 2pr $$ $$ P(\text{B}) = q^2 + 2qr $$ $$ P(\text{AB}) = 2pq $$ Finally, we are given the MLEs of the $p,q, r$, which are: $$ \hat{r} = 0.580 $$ $$ \hat{p} = 0.246 $$ $$ \hat{q} = 0.173 $$ My question is, what is the approach to testing if the given community fits the theory. Typically the approach is to use a $\chi^2$ test for goodness of fit but I cannot see how it would apply in this scenario. Any suggestions will be deeply appreciated.

UPDATE: It was brought to my attention that I am not exactly stating what aspect of the test I am not understanding. Here is that information:

  1. I am assuming that $H_1$ in this case will be that the proposed theory fits the data and $H_0$ that it does not.
  2. There does not seem to be a preferred level of significance so I goes 5% should be acceptable (if there is motivation to choose a different one, please let me know)
  3. The degrees of freedom will be 3(?)
  4. For expected frequencies I should have e.g. $0 \times P(\text{0}) = 121 \times 0.580^2 $ and so on.
  5. The observed data is given and the test statistic is easy to compute.

Is the summary I have provided the way to tackle this or am I missing some aspect of the problem?

UPDATE 2: It was mentioned to me that the degrees of freedom should be less than 3 because of the MLEs. It is not clear to me how this happens exactly. Could someone elaborate on this aspect of the problem?

FINAL UPDATE: I think that considering the comments in here and my sketch of the approach suffices to solve the problem. My thanks to everyone who offered their expertise!

1

There are 1 best solutions below

1
On BEST ANSWER

Your null hypothesis is that Hardy-Weinberg equilibrium has been reached for blood types in the population in which the blood samples were drawn.

You have observed counts $\#O = 121,\,$ $\#A = 120,\,$ $\#B = 79,\,$ and $\#AB = 33.$ At equilibrium, the probabilities are as shown in your question and the expected counts are derived using the MLEs:

The respective expected counts are $E_O = n\hat r^2 = 353*(.580)^2 = 118.7492\,$ $E_A = n(\hat p^2 + 2\hat p \hat r) = 353(.246^2 + 2(.246)(.580)) = 122.0942\,$ $E_B = n(\hat q^2 + 2\hat q \hat r) = 353(.173^2 + 2(.173)(.580)) = 81.4050\,$ and $E_{AB} = n(2 \hat p \hat q) = 353(2(.245)(.173)) = 29.9238.$

As a check we verify that the expected counts also sum to 353 (within rounding error): $118.7492+122.0942+81.4050+29.9238 = 352.1722.$

The ch-squared goodness-of-fit statistic is $$Q = \sum_{i=1}^4 \frac{(X_i - E_i)^2}{E_i}.$$ Where the observed counts are designated $X_i$.

Using R, we find that $Q = 0.4659$.

 obs = c(121, 120, 79, 33)
 exp = c(118.7492, 122.0942, 81.4050, 29.9238)
 q = sum((obs - exp)^2/exp); q
 ## 0.4658718

Under the null hypothesis the approximate distribution of $Q$ is $Q \sim Chisq(df = 1).$ The rationale for having only one degree of freedom is that the MLEs are based on estimating the proportion of A and B alleles in the population. Without estimation we would have had $df = 4-1 = 3,$ but we lose a degree of freedom for each quantity directly estimated.

The P-value of the test is the probability $0.495$ under the $Chisq(1)$ density curve to the right of $0.4659$. Thus the agreement of observed counts with expected counts is good enough to be considered consistent with equilibrium.

Below is a plot of the PDF of $Chisq(1)$ with a vertical dotted line at the observed value of the GOF statistic. (This is a 'heavy-tailed' distribution with more probability towards higher values than may be apparent from the plot.)

enter image description here