Distribution of highest numbered occupied box and number of balls therein?

161 Views Asked by At

Say we have $N$ boxes, numbered $1,2,...,N$.

We select $K$ numbers uniformly from $1,2,...,N$, and for each result we place one ball into the correspondingly numbered box.

We then select $L$ numbers uniformly from the same $1,2,...,N$ and for each result we remove one ball (if any are left) from the corresponding box.

In both iterations, the numbers are selected with replacement. The boxes have no capacity limit, that is, more than one ball can be placed into any given box.

To clarify the mechanic, say we had 4 boxes, and we'd tossed some number $K$ balls into them as specified. Now, say $L$ was 4, and the resulting samples of numbers were ${1,3,4,4}$. We'd remove one ball from box 1 (if any were there), one ball from box 3 (if any...), and two balls from box 4 (if any were there, so if box 4 had 2 or fewer balls it will be left empty).

I'm interested in the probability distribution of the highest numbered occupied box (if any) and the associated number of remaining balls in that box.

I've written a couple of methods in my CAS to arrive at the result, one simply enumerating the multinomial possibilities and doing the corresponding machinations, the other using a generating function and pulling the coefficients with corresponding machinations.

As requested in comments, an example for the results for the case of $N=6$ boxes, $K=3$ tosses, and $L=2$ removals is as follows (left column is box number, followed left-to-right with probabilities that box is the highest numbered occupied box, with number of remaining balls $1,2,3$ seen there):

enter image description here

Both work fine up to $K$ and $L$ of 15 with $N$ up to ~6, but the complexity of the methods means it gets slow pretty quickly.

Is there a more efficient means to arrive at the desired results?

1

There are 1 best solutions below

1
On BEST ANSWER

For your unrefined probability with $n$ being the largest occupied box, you effectively want to count the number of sequences of pairs $$((A_1, B_1), \ldots, (A_N, B_N))$$ where

  1. $(A_1, \ldots, A_N)$ is an ordered set partition of $\{1, \ldots, K\}$,
  2. $(B_1, \ldots, B_N)$ is an ordered set partition of $\{1, \ldots, L\}$,
  3. $|A_n| > |B_n|$, and
  4. $|A_i| \leq |B_i|$ for $i > n$.

(These ordered set partitions allow empty blocks.)

Using standard theory of bivariate exponential generating functions, this count is $$\left[\frac{x^K}{K!} \frac{y^L}{L!}\right] e^{(n-1)(x+y)} G_>(x, y) G_{\leq}(x, y)^{N-n} \qquad (*)$$ where $$G_>(x, y) = \sum_{i>j\geq 0} \frac{x^i}{i!} \frac{y^j}{j!}$$ and $$G_{\leq}(x, y) = \sum_{0 \leq i \leq j} \frac{x^i}{i!} \frac{y^j}{j!}.$$ Divide (*) by $N^{K+L}$ to get the probability.

While $G_>$ and $G_{\leq}$ don't meaningfully simplify, we can truncate them with $i \leq K$ and $j \leq L$, toss them into a CAS to do the polynomial multiplications quickly, and then extract coefficients. If you want to do this for fixed $N, n$ and many $K, L$, I doubt you'll get anything faster. Of course, if you want asymptotics in certain regimes, torturing (*) would be a good place to start. For fixed $N, n$ you should be able to use singularity analysis to determine behavior for large $K, L$.

"Nicer" answers would likely imply simplifications to (*), which probably don't exist, so I wouldn't be hopeful for such things.

You can get the more refined probability with $n$ being the largest occupied box and $m$ being the number of balls in it by replacing (3) with $|A_n| = |B_n|+m$ and replacing (*) with $$\left[\frac{x^K}{K!} \frac{y^L}{L!}\right] e^{(n-1)(x+y)} G_m(x, y) G_{\leq}(x, y)^{N-n} \qquad (**)$$

where $$G_m(x, y) = \sum_{j \geq 0} \frac{x^{j+m}}{(j+m)!} \frac{y^j}{j!}.$$

I implemented this (not well) in Mathematica. It certainly slows down quickly, but I for instance did the $K=20, L=16, N=30, n=10, m=6$ case in about 10 seconds.

enter image description here