Distribution of number of non-zero counts of a multinomial distributed set

1.1k Views Asked by At

The multinomial count distribution for a set of $M$ categories is

$P((n_1, \dots,n_M)|N)=\left(\frac{N!}{n_1!\dots n_m!}\prod_{i=1}^Mp_i^{n_i}\right)\delta\left( \sum_{i=1}^M n_i-N\right)\;,$
where category $i=1, \dots, m$ appears $n_i$ times, and the Dirac $\delta$-function appears to enforce the sum constraint that $\sum_{i=1}^M n_i=N$.

What is the distribution, $P(m)$, of the number of observed categories, i.e. the number of categories with non-zero counts, $m=\sum_{i=1}^M\mathrm{sgn}[n_i]$, where $\mathrm{sgn}(x)=1$ for $x>0$ and $0$ for $x=0$.

I couldn't find anything in a quick internet search, but I am still hoping there may be a closed form solution, even for the case where the $p_i$ are all distinct.

One approach is to focus instead on the distribution, $P_0(m_0)$, of zero counts, $m_0$, from which I think $P(m)=1-P_0(M-m)$.

A specific subset of $m_0$ categories is denoted $\alpha=\{\alpha_1, \dots, \alpha_{m_0}\}$, where $\alpha_i\in\{1, \dots, M\}$. There are $\binom{M}{m_0}$ distinct $\alpha$ subsets. For a given $\alpha$, the probability to obtain one of the $\alpha$ categories in a single count sample is $p_\alpha=\sum_{i=1}^{m_0}p_{\alpha_i}$. Thus, the probability to obtain something other than the $\alpha$ categories across $N$ samples is $(1-p_\alpha)^N$. This representation ignores permutations of the $\alpha$ subset. There are $m_0!$ such permutations.

If someone knows a solution, please share. In lieu of a solution, I could use help in the combinatorics in the sum of $(1-p_\alpha)^N$ over subsets $\alpha$ and their orderings.

3

There are 3 best solutions below

0
On

Here is an alternative solution based on generating functions.

We want to find the distribution $P(m|N)$ for the number $m$ of nonzero counts given a multinomial with parameters $N$ and a set $\lbrace p_k \rbrace_{k = 1}^M$ of probabilities. To do this, we will construct a generating function \begin{equation*} F(x,y) = \sum_{n = 0}^\infty \sum_{m = 0}^\infty f(n,m) x^n y^m \end{equation*} such that $f(N,m) = P(m|N)$. In that construction, $x$ tracks the sum of the counts, while $y$ tracks the number of nonzero counts.

A usual trick with generating functions is to try to find a product representation (for probability generating functions(PGFs), this is motivated by the fact that for the sum of independent variables, you multiply their PGF). Here, note that $F(x,y)$ is not exactly a PGF ($F(1,1) \neq 1$), but for all practical purposes, it shares the same properties.

Even if the counts for all elements $k$ are not independent, let's consider the generating function \begin{align*} G_k(x,y) &= 1 + y(e^{p_k x} -1) \;, \\ &= 1 + y \left[p_kx + \frac{(p_kx)^2}{2!} + \dots \right] \end{align*} Note that $y$ multiplies all terms having a power of $x$ greater than one, meaning it tracks all nonzero counts. If we multiply all individual generating functions, we get \begin{align*} F(x,y) = \prod_{k=1}^M G_k(x,y) \;. \end{align*}

Now the beauty of this approach: if we expand $F(x,y)$ in powers of $x$ and $y$, for a given pair $(n,m)$, we would see that it sums over all possible combinations of counts such that the sum is $n$, $m$ are nonzero, and we have the correct multinomial factors for the powers of each $p_k$ and the factorial in the denominator. The only factor missing is the $N!$ at the numerator. To adjust for this, we simply replace $G_k(x,y)$ by \begin{align*} G_k(x,y) &= 1 + y(e^{a p_k x} -1) \;, \end{align*} where $a = (N!)^{1/N}$. Again one can verify that the expansion now yield the correct terms for the multinomial distribution when $n = N$.

The last step is to invert this generating function using a fast Fourier transform (in 2 dimensions). While the whole thing seems quite complicated, the resulting code is short (see below an example in Python for uniform probabilities), I have verified that it matches with the exact result by Tobias. And this is fast! It is $O(NM^2 + NM\log(N)\log(M))$ to compute the distribution.

p = np.ones(10) #probabilities
p /= sum(p)
N = 10 #sum of counts

a = np.exp(loggamma(N+1)/N)
G = lambda x,y: 1+y*(np.exp(a*p*x)-1)
F = lambda x,y: np.prod(G(x,y))

L1 = N+1
L2 = len(p)+1
l1 = np.arange(L1)
l2 = np.arange(L2)
c1 = np.exp(2*np.pi*1j*l1/L1)
c2 = np.exp(2*np.pi*1j*l2/L2)

spectrum = np.zeros((L1,L2),dtype=complex)
for j,k in product(range(L1),range(L2)):   
    spectrum[j,k] = F(c1[j],c2[k])

f = np.absolute(np.fft.fft2(spectrum)) / (L1*L2)

Some remarks: 1) here there are technically some small numerical errors due to aliasing, but since the distribution $f(n,m)$ is quickly decreasing in $n$, for a sufficiently large $N$, there is no problem. Otherwise, one can always calculate the Fourier spectrum for a larger range (increasing $L_1$ in the code above, increasing $L_2$ won't help) and discard the extra values. 2) I tested the code above for different kinds of distributions, and it was overall very stable. However, for $N \gtrsim 90$, numerical errors became an issue. For large $N$, an approximation based on the first few moments should be sufficient anyway.

1
On

I don't think this question has a closed-form solution in general. However, there is one special case where $p_i=p=\frac{1}{M}, \forall p$, e.g., a fair 6-faces die. In such case, we can use the stars and bars. Assuming $m <= min(M, N)$.

Say we have M balls to put into N buckets. No ball should be left behind. We have totally $\binom{N + M -1}{N}$ assignments. This is the denominator. Having exactly $m$ non-empty buckets can be viewed as a two-step process. First, select $m$ buckets, $\binom{M}{m}$ possible selections. And then arrange the $N$ ball(s) in the selected buckets, $\binom{N-1}{m-1}$. For example, if $M=5, N=3$, there will be following cases:

  1. Only one bucket is non-empty. You can select any of the three buckets and put all balls in. #possibilities = 3
  2. Only two buckets are non-empty. Selecting two buckets has 3 possibilties. Putting 5 balls in 2 buckets where no empty bucket has 4 possibilties. #possibilities = 12
  3. All three buckets are non-empty. Just arrange 5 balls in three buckets. Again, no empty buckets. #possibilities = 6.

The probablities are then, e.g., $p(m=3) = 6 / \binom{5+3-1}{5} = 0.2857$. R code for this. Note since now, I am using $k$ instead $M$.

multinomial_non0_count_v1 <- function(k, n) {
  stop = min(k, n)
  total = choose(n + k -1, n)
  for (m in 1:stop) {
    prob = choose(k, m) * choose(n-1, m-1) / total
    print(paste(m, prob))
  }
}

The above function is not numerically stable for large k,n,m because the binomial coefficients become too large to compute. However, we can calculate the probability by expanding the the binomial coefficients and canceling through division. $p = \frac{[k,...,m+1][k-1,...,m][n,...,n-m+1]}{[n+k-1,...,n][k-m,...,1]}$. Abusing the notation of $[x,...,y]$, here it means $x(x-1)(x-2)...y$, the product of continuous integers between $x$ and $y$, where, $x >= y$.

prob <- function(m, k, n) {
  stopifnot(m <= min(k,n))
  if (m == k) {
    res = exp(sum(log(seq(n, n-m+1))) - sum(log(seq(n+k-1, n))))
  } else {
    minuend = sum(log(seq(k, m+1))) + sum(log(seq(k-1, m))) + sum(log(seq(n, n-m + 1)))
    subtrahend = sum(log(seq(n+k-1, n))) + sum(log(seq(k-m, 1))) 
    res = exp(minuend - subtrahend)
  }
  res
}
2
On

I don't believe there is a general closed form solution. You can find the exact solution using dynamic programming. You can approximate the distribution by computing the mean and variance and the use a normal approximation. And finally for the case where $p_1 = p_2 = \ldots p_M = 1/M$ you can find a closed form solution using Stirling numbers of the second kind.

Exact solution using dynamic programming

Define $S_k = \sum_{i=1}^k n_i$, $I_k = \sum_{i=1}^k \mathbb{I}(n_i > 0)$. Using dynamic programming we will iteratively compute the probability that the $k$ first variables sum to $n$ and that $m$ of the first $k$ are non-zero,

$$ d(k, n, m) = P(S_k = n, I_k = m) $$

Initialize the dynamic programming table with,

$$ d(0, n, m) = \begin{cases} 1 & n=0, m=0 \\ 0 & \text{otherwise} \end{cases} $$

Then recurse over $k=1,\ldots,M$,

$$ d(k, n, m) = \operatorname{binomial}(0; N-n, \tilde{p}_k)d(k-1, n, m) + \sum_{r=1}^{n}\operatorname{binomial}(r; N-n+r,\tilde{p}_k)d(k-1, n-r, m-1), \quad m \leq k $$

where $\tilde{p}_{k} = p_{k} / \sum_{i=k+1}^M p_i$. The first term accounts for the case where the k'th variable is zero and therefore does not contribute new non-zero variables, and the second term (the sum) accounts for all cases where the k'th variable is non-zero. Below is a naive R implementation. The loops can be rearranged to have the final loop being over $r$ rather than $m$ to avoid computing the same binomial densities multiple times, still I don't think this can be done better than $\mathcal{O}(M^2N^2)$.

p <- c(0.1,0.1,0.2,0.2,0.4)
N <- 6
M <- 5

p_tilde <- p / rev(cumsum(rev(p)))
d <- array(0, dim = c(M+1,N+1,M+1))
d[1,1,1] = 1
for (k in 1:M) {
  for (n in 0:N) {
    for (m in 0:k) {
      d[k+1, n+1, m+1] <- dbinom(0, N-n, prob = p_tilde[k]) * d[k, n+1, m+1]
      if (m > 0 && n > 0) {
        r <- 1:n
        d[k+1, n+1, m+1] <- d[k+1, n+1, m+1] + 
          sum(dbinom(r, r+N-n, prob = p_tilde[k]) * d[k, n-r+1, m])
      }
    }
  }
}

Verify solution by simulation

> table(colSums(rmultinom(1e6, size = N, prob = p) > 0 )) / 1e6

       1        2        3        4        5 
0.004245 0.114157 0.449484 0.374784 0.057330 
> d[M+1,N+1,]
[1] 0.000000 0.004226 0.114734 0.449280 0.374160 0.057600

Approximation solution

To use a normal approximation we must compute the mean and variance. First the mean can be computed as,

$$ \mathbb{E}[m]= \mathbb{E}\left[\sum_{i=1}^M\mathbb{I}(n_i>0)\right] = \sum_{i=1}^M\mathbb{E}[\mathbb{I}(n_i>0)] = \sum_{i=1}^M 1-(1-p_i)^N. $$

Next the variance, \begin{equation} \begin{split} \mathbb{V}[m] &= \mathbb{V}\left[\sum_{i=1}^M\mathbb{I}(n_i>0)\right] \\ &= \sum_{i=1}^M\mathbb{E}[\mathbb{I}(n_i>0)] \\ &= \sum_{i=1}^M\mathbb{V}[\mathbb{I}(n_i>0)] + 2 \sum_{1\leq i < j \leq M} \operatorname{Cov}(\mathbb{I}(n_i>0), \mathbb{I}(n_j>0)) \\ &= \sum_{i=1}^M (1-(1-p_i)^N)(1-p_i)^N + 2 \sum_{1\leq i < j \leq M} \operatorname{Cov}(1-\mathbb{I}(n_i>0), 1-\mathbb{I}(n_j>0)) \\ &= \sum_{i=1}^M (1-(1-p_i)^N)(1-p_i)^N + 2 \sum_{1\leq i < j \leq M} \left[(1-p_i-p_j)^N - (1-p_i)^N(1-p_j)^N\right] \end{split} \end{equation}

This is an $\mathcal{O}(N^2)$ algorithm. If $M$ is very large you may omit the covariance terms which you should generally expect to be small and you can compute the approximation in $\mathcal{O}(N)$.

computed_variance <- sum((1-p)^N*(1-(1-p)^N))
for (i in 1:(M-1)) {
  for (j in (i+1):M) {
    computed_variance <- computed_variance + 2*((1-p[i]-p[j])^N-(1-p[i])^N*(1-p[j])^N)
  }
}

> computed_variance
[1] 0.602115
> var(colSums(rmultinom(1e6, size = N, prob = p) > 0 ))
[1] 0.602461

Uniform case

Finally, in the uniform case $p_i=p\,\forall i$, we must count how many ways we can distribute $N$ numbered objects in $M$ number using exactly $m$ of the bins. This can be done in a two step procedure, first choose $m$ of the $M$ bins, this can be done in $M \choose m$ ways. Then distribute the $N$ objects into each of the $m$ bins with at least one object in each bin, this can be done in $m!S(N,m)$ where $S(N,m)$ is the Stirling number of the second kind. We end up with the following formula

$$ p(m) = {M \choose m} m! S(N,m) / M^N $$

Example, $N=4$, $M=5$, $m=3$ ($S(4,3) = 6$):

> table(colSums(rmultinom(1e6, size = 4, prob = rep(1/5, 5)) > 0)) / 1e6

       1        2        3        4 
0.008013 0.224771 0.575393 0.191823 
> choose(5, 3) * factorial(3) * 6 / 5^4
[1] 0.576