Derivation of mean and variance of Hypergeometric Distribution

49.8k Views Asked by At

I need clarified and detailed derivation of mean and variance of a hyper-geometric distribution.

If a box contains $N$ balls, $a$ of them are black and $N-a$ are white, and $n$ number of balls are drawn at random without replacement , then the probability of getting $x$ black balls (and obviously $n-x$ white balls) is given by the following p.m.f.

The p.m.f is $$f(x) =\frac{(_{a}C_x) \cdot (_{N-a}C_{n-x})}{_{N}C_n} $$

The mean is given by: $$ \mu = E(x) = np = na/N$$ and, variance $$ \sigma^2 = E(x^2)+E(x)^2 = \frac{na(N-a)(N-n)}{N^2(N^2-1)} = npq \left[\frac{N-n}{N-1}\right] $$ where $$ q = 1-p = (N-a)/N$$

I want the step by step procedure to derive the mean and variance. Thank you.

3

There are 3 best solutions below

2
On BEST ANSWER

This is a rather old question but it is worth revisiting this computation. Let $$\Pr[X = x] = \frac{\binom{m}{x} \binom{N-m}{n-x}}{\binom{N}{n}},$$ where I have used $m$ instead of $a$. We can ignore the details of specifying the support if we use the conventions on binomial coefficients that evaluate to zero; e.g., $\binom{n}{k} = 0$ if $k \not\in \{0, \ldots, n\}$. Then we observe the identity $$x \binom{m}{x} = \frac{m!}{(x-1)!(m-x)!} = \frac{m(m-1)!}{(x-1)!((m-1)-(x-1))!} = m \binom{m-1}{x-1},$$ whenever both binomial coefficients exist. Thus $$x \Pr[X = x] = m \frac{\binom{m-1}{x-1} \binom{(N-1)-(m-1)}{(n-1)-(x-1)}}{\frac{N}{n}\binom{N-1}{n-1}},$$ and we see that $$\operatorname{E}[X] = \frac{mn}{N} \sum_x \frac{\binom{m-1}{x-1} \binom{(N-1)-(m-1)}{(n-1)-(x-1)}}{\binom{N-1}{n-1}},$$ and the sum is simply the sum of probabilities for a hypergeometric distribution with parameters $N-1$, $m-1$, $n-1$ and is equal to $1$. Therefore, the expectation is $\operatorname{E}[X] = mn/N$. To get the second moment, consider $$x(x-1)\binom{m}{x} = m(x-1)\binom{m-1}{x-1} = m(m-1) \binom{m-2}{x-2},$$ which is just an iteration of the first identity we used. Consequently $$x(x-1)\Pr[X = x] = \frac{m(m-1)\binom{m-2}{x-2}\binom{(N-2)-(m-2)}{(n-2)-(x-2)}}{\frac{N(N-1)}{n(n-1)}\binom{N-2}{n-2}},$$ and again by the same reasoning, we find $$\operatorname{E}[X(X-1)] = \frac{m(m-1)n(n-1)}{N(N-1)}.$$ It is now quite easy to see that the "factorial moment" $$\operatorname{E}[X(X-1)\ldots(X-k+1)] = \prod_{j=0}^{k-1} \frac{(m-j)(n-j)}{N-j}.$$ In fact, we can write this in terms of binomial coefficients as well: $$\operatorname{E}\left[\binom{X}{k}\right] = \frac{\binom{m}{k} \binom{n}{k}}{\binom{N}{k}}.$$ This gives us a way to recover raw and central moments; e.g., $$\operatorname{Var}[X] = \operatorname{E}[X^2] - \operatorname{E}[X]^2 = \operatorname{E}[X(X-1) + X] - \operatorname{E}[X]^2 = \operatorname{E}[X(X-1)] + \operatorname{E}[X](1-\operatorname{E}[X]),$$ so $$\operatorname{Var}[X] = \frac{m(m-1)n(n-1)}{N(N-1)} + \frac{mn}{N}\left(1 - \frac{mn}{N}\right) = \frac{mn(N-m)(N-n)}{N^2 (N-1)},$$ for example. What is nice about the above derivation is that the formula for the expectation of $\binom{X}{k}$ is very simple to remember.

1
On

The trials are not independent, but they are identically distributed, and indeed, exchangeable, so that the covariance between two of them doesn't depend on which two they are. They expected number of black balls on any one trial is $a/N$, so just add that up $n$ times.

The variance for one trial is $pq=p(1-p) = \dfrac a N\cdot\left(1 - \dfrac a N\right)$, but you also need the covariance between two trials. The probability of getting a black ball on both of the first two trials is $\dfrac{a(a-1)}{N(N-1)}$. So the covariance is \begin{align} \operatorname{cov}(X_1,X_2) & = \operatorname{E}(X_1 X_2) - (\operatorname{E}X_1)(\operatorname{E}X_2) \\[10pt] & = \Pr(X_1=X_2=1) - (\Pr(X_1=1))^2 \\[10pt] & = \frac{a(a-1)}{N(N-1)} -\left( \frac a N \right)^2. \end{align}

Add up $n$ variances and $n(n-1)$ covariances to get the variance: $$ \operatorname{var}(X_1+\cdots+X_n) = \sum_i \operatorname{var}(X_i) + \sum_{i,j\,:\,i\ne j}\operatorname{cov}(X_i,X_j). $$

(You'll need to do a bit of routine algebraic simplification.)

0
On

There is a way to compute the variance of the hypergeometric without too many calculations, by going through $\mathbb E[\binom X2]$ first. (This is building on the logic of heropup's answer, but avoids working with summations.)

If $X$ is the number of black balls drawn, then $\binom X2$ counts the number of pairs of black balls drawn. There is another way to think about this number:

  • Among the $n$ balls drawn, there are $\binom n2$ pairs.
  • The probability that both balls in a pair are black is $\binom a2 / \binom N2$.

By linearity of expectation, whenever we have a bunch of equally-likely events, to count the expected number of them that happen, multiply the number of events by the probability that a single event happens. (No independence is required.) Therefore $$ \mathbb E\left[\binom X2\right] = \binom n2 \cdot \frac{\binom a2}{\binom N2} $$ or (multiplying by $2$ and canceling constants) $$ \mathbb E[X(X-1)] = \frac{n(n-1) \cdot a(a-1)}{N(N-1)}. $$ From here, since $\text{Var}[X] = \mathbb E[X(X-1)] + \mathbb E[X] - \mathbb E[X]^2$, we get $$ \frac{n(n-1) \cdot a(a-1)}{N(N-1)} + \frac{n \cdot a}{N} - \frac{n^2 \cdot a^2}{N^2} $$ for the variance. The rest is simplification.