Moment Generating Function (MGF) of Hypergeometric Distribution is No Greater Than MGF of Binomial Distribution with the Same Mean

7.3k Views Asked by At

The Setup

Consider a hypergeometric distribution $X$ with parameters $N, n, m,$ i.e. $$\mathbb{P}[X = k] = \frac{{m \choose k} {N-m \choose n-k}}{{N \choose n}},$$ for $k$ running from $0$ to $\min\{n,m\}.$ The mean $\mathbb{E}[X]$ is $\frac{mn}{N}.$ Let $Y$ be a $\mathsf{Bi}(m,n/N)$ distribution, i.e. $$\mathbb{P}[Y = k] = {m \choose k}\left(\frac{n}{N}\right)^k \left(1-\frac{n}{N} \right)^{m-k}.$$ $Y$ also has mean $\frac{mn}{N}.$

(In $X,$ the roles of $m$ and $n$ are interchangeable, so the following should hold for $\mathsf{Bi}(n,m/N)$ also.) In Janson, Łuczak, and Ruciński's Random Graphs (JLR), the proof of Theorem 2.10 (page 29 in my edition) is largely left as an exercise. It depends on the following fact:

The Question

$$(1) \ \ \forall u \in \mathbb{R}, \mathbb{E}[e^{uX}] \leq \mathbb{E}[e^{uY}].$$ That is, the MGF of $X$ is less than or equal to the MGF of $Y.$ Intuitively, this makes sense, since $Y$ seems more likely than $X$ to be far from its mean. How can we prove this?

What I've Tried

JLR references Hoeffding 1963 for this result. I went through the parts of that paper that seemed most relevant to this problem--"Sampling from a Finite Population" and some of "Sums of Dependent Random Variables", but I was unable to find anything that proves $$\forall u \in \mathbb{R}, \mathbb{E}[e^{uX}] \leq \mathbb{E}[e^{uY}].$$ (I skimmed the rest of the paper as well.) One more remark about Hoeffding: when this paper talks about $\mathbb{E}[e^{cZ}]$ it seems to usually require $c>0$, whereas our desired result holds for all $u \in \mathbb{R}.$

I have tried writing out explicit expressions for these MGFs: $$\mathbb{E}[e^{uX}] = \sum_{k} \frac{{m \choose k} {N-m \choose n-k}}{{N \choose n}} e^{uk}$$ and $$\mathbb{E}[e^{uY}] = \sum_{k} {m \choose k} \left(\frac{n}{N} \right)^k \left(1-\frac{n}{N}\right)^{m-k} e^{uk}.$$ These summands share factors of ${m \choose k}$, and $e^{uk} > 0,$ so one might hope to prove $$(2) \ \ \forall k \leq m,n \leq N, \ \ \frac{{N - m \choose n-k}}{{N \choose n}} \leq \left(\frac{n}{N} \right)^k \left(1 - \frac{n}{N} \right)^{m-k}.$$ $(2)$, if true, would prove $(1), \mathbb{E}[e^{uX}] \leq \mathbb{E}[e^{uY}].$ I currently do not believe this inequality is true, based on some numbers I put into Maple (but I may have made an error there). In any case, although this inequality would be sufficient, it is not necessary for proving (1).

I also tried expanding $e^{uX} = 1 + uX + \frac{u^2 X^2}{2} + \cdots,$ and similarly for $e^{uY}.$ For $u \geq 0,$ (probably using Fubini-Tonelli somewhere), it would suffice to prove $$(3) \ \ \forall K \in \mathbb{N}, \ \ \mathbb{E}[X^K] \leq \mathbb{E}[Y^K].$$ I do not know how to do this (although it seems true), and in any case it wouldn't directly prove (1) when $u < 0.$

2

There are 2 best solutions below

0
On BEST ANSWER

Let $r^{\underline k}$ denote the falling power $r(r-1)(r-2)\dotsm (r-k+1)$.

The hypergeometric $X$ denotes the number of successes among $m$ draws without replacement from a pool with $N$ options, $n$ of which are successful. So $X^{\underline k}$ denotes the number of $k$-tuples of distinct successes among the $m$ draws. By linearity of expectation, $$ \mathbb E[X^{\underline k}] = m^{\underline k} \cdot \frac{n^{\underline k}}{N^{\underline k}} $$ as there are $m^{\underline k}$ $k$-tuples of distinct draws, and a probability of $\frac{n^{\underline k}}{N^{\underline k}}$ that all $k$ draws in a $k$-tuple are successful.

By similar reasoning, we have $$ \mathbb E[Y^{\underline k}] = m^{\underline k} \frac{n^k}{N^k} $$ since here, the probability that all $k$ draws in a $k$-tuple are successful is just $(\frac nN)^k$.

Comparing these: we have $\frac{n-k}{N-k} \le \frac{n-k+1}{N-k+1} \le \dots \le \frac{n-1}{N-1} \le \frac nN$, so in particular $\frac{n^{\underline k}}{N^{\underline k}} \le \frac{n^k}{N^k}$, and $\mathbb E[X^{\underline k}] \le \mathbb E[Y^{\underline k}]$.

The $k^{\text{th}}$ derivative of $\mathbb E[z^X]$ with respect to $z$ is $\mathbb E[X^{\underline k} z^{X-k}]$, and so in particular the $k^{\text{th}}$ derivative at $1$ is just $\mathbb E[X^{\underline k}]$. Therefore when we compare the Taylor expansion of $\mathbb E[z^X]$ and $\mathbb E[z^Y]$ around $z=1$, the former always has smaller coefficients than the latter.

We conclude that $\mathbb E[z^X] \le \mathbb E[z^Y]$ for $z \ge 1$; in other words, $\mathbb E[e^{uX}] \le \mathbb E[e^{uY}]$ for $u \ge 0$.

For $u\le 0$, let $v=-u$, and note that $m-X$ and $m-Y$ are hypergeometric and binomial, respectively, with the same relationship as $X$ and $Y$. So we have $\mathbb E[e^{v(m-X)}] \le \mathbb E[e^{v(m-Y)}]$; dividing by the constant $e^{vm}$, we get $\mathbb E[e^{-vX}] \le \mathbb E[e^{-vY}]$.

0
On

Let $N \in \mathbb{Z}_{>0}.$ Notation: $[N] = \{1,2,...,N\}.$ Let $C = \{c_1, ..., c_N\}$ be a set of variables, and $\mathbb{R}^C$ the vector space of $\mathbb{R}$-linear combinations of $c_j$s.

We introduce the following notation: for $1 \leq n \leq N,$

$$C^{\underline{n}} = \{(x_1, ..., x_n) \in C^n : \forall i, j \in [n], i \neq j \implies x_i \neq x_j \}.$$

That is, $C^{\underline{n}}$ is the set of ordered $n$-tuples from $C$ with all distinct elements. (For $N \in \mathbb{R},$ we use the different definition $N^{\underline{n}} := N (N-1) \cdots (N-n+1)$; take caution not to confuse the two usages. $\# [N]^{\underline{n}} = N^{\underline{n}}$.)

Next we define $C_{\oplus n}$: $$\mathbb{R}^C \supset C_{\oplus n} := \{x_1 + \cdots + x_n : \text{ each } x_i \in C\}.$$ Let $f : C_{\oplus n} \overset{\sim}{\to} \Gamma$ be a bijection from $C_{\oplus n}$ to another set of variables, $\Gamma.$

For $\gamma \in \Gamma$ and $v \in \mathbb{R}^\Gamma,$ $\langle \gamma, v \rangle$ is the coefficient of $\gamma$ in the standard-basis expansion of $v.$ Consider the following vectors in $\mathbb{R}^\Gamma$:

$$w := N^{-n} \sum_{x \in C^n} f(x_1 + \cdots + x_n)$$

and

$$u := \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} \left( \sum_{k=1}^n \sum_{\substack{r \in [n]^k, \\ r_1 + \cdots + r_k = n}} \sum_{i \in [n]^k} f(r_1 x_{i_1} + \cdots + r_k x_{i_k})\right).$$

Lemma 1. Let $k \in [n], i,i' \in [n]^{k},$ and $r \in [n]^k$ such that $r_1 + \cdots + r_k = n.$ Let $y_1 = (y_{1,1},...,y_{1,n})$ and $y_2 = (y_{2,1},...,y_{2,n}) \in C^{\underline{n}}.$ Define:

$$\gamma_1 := f(r_1 y_{1, i_1} + \cdots + r_k y_{1, i_k})$$

and

$$\gamma_2 := f(r_1 y_{2, i'_1} + \cdots + r_k y_{2, i'_k}).$$ Then $\langle \gamma_1, w \rangle = \langle \gamma_2, w \rangle > 0$ and $\langle \gamma_1, u \rangle = \langle \gamma_2, u \rangle > 0.$

Proof of Lemma 1. $\langle \gamma_1, w \rangle > 0$ and $\langle \gamma_1, u \rangle > 0$ because the term $\gamma_1 = f(r_1 y_{1,i_1} + \cdots + r_k y_{1,i_k})$ appears in both sums (and all coefficients are positive). Both $w$ and $u$ are invariant under permutations of $C$. That is, for any bijection $\sigma : C \to C,$ with $$w_{\sigma} := N^{-n} \sum_{x \in C^n} f(\sigma(x_1) + \cdots + \sigma(x_n))$$ and $$u_{\sigma} := \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} \left( \sum_{k=1}^n \sum_{\substack{r \in [n]^k, \\ r_1 + \cdots + r_k = n}} \sum_{i \in [n]^k} f(r_1 \sigma(x_{i_1}) + \cdots + r_k \sigma(x_{i_k}))\right),$$ we have $w_\sigma = w$ and $u_\sigma = u.$ Let $\sigma : C \to C$ be a bijection sending $y_{1,i_j}$ to $y_{2,i'_j}$ for $j = 1,2,...,k.$ Using the Iverson bracket notation,

$$\langle \gamma_1, w \rangle = N^{-n} \sum_{x \in C^n} [x_1 + \cdots + x_n =f^{-1}(\gamma_1)] \\ =N^{-n} \sum_{x \in C^n} [\sigma(x_1) + \cdots + \sigma(x_n) = f^{-1}(\gamma_2)] \\ =\langle \gamma_2, w_\sigma \rangle \\ =\langle \gamma_2, w \rangle.$$

Similarly, with $\sum_{(k,r,i)'}$ referring to the sum $\sum_{k=1}^n \sum_{\substack{r \in [n]^k, \\ r_1 + \cdots + r_k = n}} \sum_{i \in [n]^k},$

$$\langle \gamma_1, u \rangle = \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} \left( \sum_{(k,r,i)'} [r_1 x_{i_1} + \cdots + r_k x_{i_k} = f^{-1}(\gamma_1)] \right) \\ = \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} \left( \sum_{(k,r,i)'} [r_1 \sigma(x_{i_1}) + \cdots + r_k \sigma(x_{i_k}) = f^{-1}(\gamma_2)] \right) \\ = \langle \gamma_2, u_\sigma \rangle \\ = \langle \gamma_2, u \rangle.\ \square$$

Given the above lemma, for any composition $r$ of $n$ (that is, $r \in [n]^k$ and $r_1+ \cdots + r_k = n$), we can define:

$$c(r) = \langle \gamma_1, w \rangle$$

and

$$s(r) = \langle \gamma_1, u \rangle,$$

for any $\gamma_1 = f(r_1 y_{i_1} + \cdots + r_k y_{i_k})$ as in the statement of Lemma 1. Lemma 1 implies $c$ and $s$ are well-defined.

Consider the function $$f^* : C^{\underline{n}} \to \mathbb{R}^\Gamma \ \ \ \ (\dagger)$$ given by $$(x_1, ..., x_n) \overset{f^*}{\mapsto} \sum_{(k,r,i)'} \frac{c(r)}{s(r)} f(r_1 x_{i_1} + \cdots + r_k x_{i_k}).$$

Then (by comparing coefficients on either side) $$w = \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} f^*(x_1, ..., x_n).$$

Now consider $(X_1, X_2, ..., X_n)$ a random sample from $C$ taken (uniformly and) without replacement. That is, $\forall l \in \{0,1,...,n-1\}, X_{l+1} \in C \setminus \{X_1, ..., X_l \}$ is chosen uniformly at random. Let $\mathcal{X}:=\sum_{i=1}^n X_i.$ If we substitute in the values $c_1 = c_2 = \cdots = c_m = 1$ and $c_{m+1} = \cdots = c_N = 0,$ then $\mathcal{X}$ becomes a hypergeometric random variable with parameters $N, n, m.$ Let $(Y_1, ..., Y_n)$ be a random sample from $C$ taken uniformly and independently, with replacement. Let $\mathcal{Y}:=\sum_{i=1}^n Y_i.$ If we substitute in the values $c_1 = \cdots = c_m = 1$ and $c_{m+1} = \cdots = c_N = 0,$ then $\mathcal{Y}$ becomes a binomial $\mathsf{Bi}(m,n/N)$ distribution.

Theorem 2.[Hoeffding 1963, Theorem 4] Let $f$ be real-valued and convex. Then $$\mathbb{E}[f(\mathcal{X})] \leq \mathbb{E}[f(\mathcal{Y})].$$

Corollary 3. The MGF $\mathbb{E}[e^{uX}]$ of a hypergeometric distribution $X$ with parameters $N,n,m$ is less than or equal to the MGF $\mathbb{E}[e^{uY}]$ of a binomial distribution $Y$ with the same mean.

Proof of Corollary 3. Apply Theorem 2 to the real-valued, convex function $f:\mathbb{R}^C \to \mathbb{R}$ given by first substituting $c_1 = \cdots = c_m = 1$ and $c_{m+1} = \cdots c_N = 0$ (a linear map $\mathbb{R}^C \to \mathbb{R}$), then exponentiating $z \mapsto e^{uz}$ (a convex function). $\square$

Proof of Theorem 2. We have: $$\mathbb{E}[f(\mathcal{X})] = \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} f(x_1 + \cdots + x_n)$$ and $$\mathbb{E}[f(\mathcal{Y})] = N^{-n} \sum_{y \in C^n} f(y_1 + \cdots + y_n).$$ With $c(r), s(r),$ and $f^*$ defined as above (see ($\dagger$)),

$$\mathbb{E}[f(\mathcal{Y})] = \frac{1}{N^{\underline{n}}} \sum_{x \in C^{\underline{n}}} f^*(x_1,..., x_n).$$

Setting $f$ to be a nonzero constant function, we get $$\sum_{(k,r,i)'} \frac{c(r)}{s(r)} = 1.$$

On the other hand, setting $f(x_1 + \cdots + x_n) = x_1 + \cdots + x_n,$ we find that $$f^*(x_1, ..., x_n) = \sum_{(k,r,i)'} \frac{c(r)}{s(r)}(r_1 x_{i_1} + \cdots + r_k x_{i_k})$$ extends to a linear map that is symmetric in the variables $x_1, ..., x_n$; all such functions take the form $K(x_1+ \cdots + x_n).$ Still using $f(x_1 + \cdots + x_n) = x_1 + \cdots + x_n$, since $$\mathbb{E}[f(\mathcal{Y})] = \mathbb{E}_{z \in C^{\underline{n}}}[f^*(z)],$$ we find that $K = 1.$

We now move back to general $f$ (real-valued, convex). Summarizing what we have found so far, $$\sum_{(k,r,i)'} \frac{c(r)}{s(r)} = 1$$ and $$\sum_{(k,r,i)'} \frac{c(r)}{s(r)} (r_1 x_{i_1} + \cdots + r_k x_{i_k}) = x_1 + \cdots + x_n.$$ Therefore, applying Jensen's inequality,

$$f(x_1 + \cdots + x_n) \leq \sum_{(k,r,i)'} \frac{c(r)}{s(r)} f(r_1 x_{i_1} + \cdots + r_k x_{i_k}) \\ = f^*(x_1, ..., x_n).$$

Taking expectations on both sides,

$$\mathbb{E}_{x \in C^{\underline{n}}}[f(x_1 + \cdots + x_n)] \leq \mathbb{E}_{x \in C^{\underline{n}}}[f^*(x_1, ..., x_n)],$$ i.e. $$\mathbb{E}[f(\mathcal{X})] \leq \mathbb{E}[f(\mathcal{Y})].\ \square$$

Remark 4. Hoeffding's Theorem 4 also requires $f$ to be continuous, but this doesn't come up in the proof, so I omit it. (I've heard that every convex function from $\mathbb{R}^n \to \mathbb{R}$ is continuous, but I haven't seen the proof.)

Remark 5. Thanks to Larry Frolov, Mihail Tarigradschi, Robert Dougherty-Bliss, and anyone else who worked on this problem with me. Special thanks to Jinyoung Park, who pointed out that Hoeffding's Theorem 4 was what I needed.