Hypergeometric Hypothesis Testing

1.7k Views Asked by At

Suppose I have a jar with 9000 balls, each ball is either black or red. I pull a sample of 6000 and observe that 53% (3180) are red. I want to conduct a hypothesis test where $H_{0}:=$ Less than 50% of balls in the jar are red.

However, I am willing to alter that hypothesis if there is a more reasonable way to formulate it that will 'essentially mean the same thing'. I have done some research to try and figure out the best way to go about this, and I discovered the hypergeometric test for over/under representation.

The hypergeometric test uses the hypergeometric distribution to measure the statistical significance of having drawn a sample consisting of a specific number of $k$ successes (out of $n$ total draws from a population of size $N$ containing $K$ successes. In a test for over-representation of successes in the sample, the hypergeometric $p$-value is calculated as the probability of randomly drawing $k$ or more successes from the population in $n$ total draws. In a test for under-representation, the $p$-value is the probability of randomly drawing $k$ or fewer successes.

According to this, I should take $N = 9000, k = 3180, n = 6000, K < 4500$ and then if $f(n,k)$ is the PDF of a hypergeometric distribution with $N = 9000$ and $K < 45000$, I find my $p$-value as $$P = \sum_{i = 3180}^{6000} f(i,6000).$$

Does this make sense? How do I handle the fact that $K < 4500$, should I do the summation for each value of $K < 4500$? Or would it make sense to set $K = 4500$ Does my set up accurately reflect the hypothesis I set out to test? Should I alter my approach, or perhaps there is a better hypothesis I could formulate?

I have almost no statistics background, just one or two classes when I was in undergrad, so I am not only lacking the ability to set up and solve this problem, but also need help interpreting the results and meaning. Thanks!

1

There are 1 best solutions below

8
On BEST ANSWER

First, the exact method of your displayed equation can be used to get the P-value of the test of $H_0: p = .5$ against $H_a: p > .5,$ where $p$ is the proportion of red balls in the urn.

To be specific, let the sample size (without replacement) be $n = 6000,$ the total number of balls in the urn be $N = 9000,$ of which $r$ are red and $b = N-r$ are black. The null hypothesis is $H_0: p = r/N = .5$ and the alternative is $H_a: p > .5$

Then in R (where phyper is a hypergeometric CDF), the P-value $\approx 0$ can be computed as

n = 6000; r = b = 4500
1 - phyper(3179, r,b, n) 
[1] 4.440892e-16

Thus, A plot of relevant bars in the PDF of the null hypergeometric distribution is shown below. Notice that the observed value $X = 3179$ is off the graph to the right.

x = 2900:3100;  pdf = dhyper(x, r,b, n)
plot(x, pdf, type="h", ylab="PDF", 
     main="Null Hypergeometric Distribution")
abline(h=0, col="green2")

enter image description here

With population and sample sizes as in your problem the null hypergeometric distribution is well approximated by a normal distribution with $\mu = 3000$ and a standard deviation computed using the appropriate formula, involving a 'finite population correction' $\frac{N-n}{N-1}$ (see the variance in the Wikipedia link). This correction is especially important when, as here, the sample size (without replacement) is more than 10% of the population size. Thus a serviceable approximation to the P-value can be found using a normal distribution.

[Note: If you are going to use software to compute hypergeometric probabilities, you need to make sure the programming is written with sufficient to care to avoid 'overflows' with the large factorials involved. See computational notes in the R documentation.]