Probability of having at least $j$ collisions when tossing $m$ balls into $n$ bins

705 Views Asked by At

Suppose that we throw $m$ balls into $n$ bins uniformly and independantly at random. We consider collisions as distinct unordered pairs, e.g., if 3 balls are tossed in one bin, we count 3 collisions.

What is the probability of having at least $j$ collisions when we throw $m$ balls into $n$ bins, such that at least $j$ bins are empty at the end of the process?

##Case $m \leq n$: As we suppose $m < n$, this question is in fact equivalent to computing the probability that we toss $m$ balls in $m-j$ bins. Indeed, we throw our $m-j$ first balls in distinct bins, and the $j$ remaining balls in these $m-j$ bins. We can view tossing $m$ balls in $m-j$ as the following equation: $$x_1 + x_2 + \cdots + x_{m-j} = m$$ where $x_i$ is the number of balls in bin $i$. The number of solutions in $\mathbb{N}$ of this equation is a $m$-combination on a set of size $m-j$, i.e., $\binom{m+m-j-1}{m} = \binom{2m - j - 1}{m}$.

The number of ways to choose $m-j$ bins out of $n$ is $\binom{n}{m-j}$.

The number of possibilities to throw $m$ balls in $n$ bins is $n^m$, thus $$\mathbb{P}[\text{at least } j \text{ collisions}] = \frac{\binom{2m - j - 1}{m} \cdot \binom{n}{m-j}}{n^m}.$$

What do you think of this approach? Am I overcounting/undercounting something?

##Case $m > n$: In my opinion, this case is much harder to solve.

As I count collisions as distinct unordered pairs, the total number of collisions is $$\sum_{i = 1}^n \binom{x_i}{2}$$ where $x_i$ is the number of balls in bin $i$. Hence, to have exactly $j$ collisions, a necessary condition is to have $\sum_{i=1}^{n} \binom{x_i}{2} = j$. Let's tweak this equation a bit. $$ \begin{align*} \sum_{i=1}^{n} \binom{x_i}{2} = j & \iff \sum_{i=1}^n \frac{x_i!}{2 \cdot (x_i-2)!} = j\\ & \iff \sum_{i=1}^{n} \frac{x_i!}{(x_i-2)!} = 2j\\ & \iff \sum_{i=1}^n x_i(x_i-1) = 2j\\ & \iff \sum_{i=1}^n x_i^2 = 2j + \sum_{i=1}^n x_i\\ & \iff \sum_{i=1}^n x_i^2 = 2j + m \end{align*} $$ where the last equality follows from the fact the sum of the balls in each bins is the number of balls tossed.

Therefore, this problem is equivalent to solving the following problem: how many representations as a sum of $n$ squares does a number have? This problem seems far from trivial to me, as I did not find easy explicit formulas to work with.

However this approach was for finding exactly $j$ collisions, and I am interested in finding at least $j$ collisions. As $m > n$, we will have some collisions. We can express $m = qn + r$ for some $q, r \in \mathbb{N}$. The least collisions we will have is when the balls are uniformly distributed amongst the bins, thus we will have at least $$C_{min} = \sum_{i=1}^r \binom{q+1}{2} + \sum_{i=1}^{n-r} \binom{q}{2}$$ collisions, which forces $j > C_{min}$. The problem thus becomes, in how many ways can we distribute $m$ balls into $n$ bins such that $\sum_{i=1}^n \binom{x_i}{2} \geq j$.

Here I am stuck, I can't find an easy way with combinatorics to solve this. I am fine with this problem being solved asymptotically, so if you have any ideas and/or references on the subject... :).

Thanks for reading and helping


#Edit I think both cases above are wrong, because I was undercounting some things. Probability with the function given in the case <span class=$m \leq n$" /> This image shows that the probability of having $j$ collisions do not increase nor decrease. I was expecting something like a gaussian, or at least something with a low probabiliy of having a few collisions, a high probability of having an average number of collisions, and a low probability of having a lot of collisions.

I thought about another way of counting, and I think the results I obtain are better. For that, I need the following Lemma, taken from this paper, page 16.

Lemma 1: The probability that exactly $k$ bins are not empty after throwing $m$ balls is $\frac{\binom{n}{k}k!S(m,k)}{n^m}$, where $S(m,k)$ is a Stirling number of the second kind.

With this Lemma, I obtain the following result.

Lemma 2: Let $m$ be the number of balls thrown uniformly and independently ar random into $n$ bins. The probability of having at least $j>1$ collisions and at least $j$ empty bins is $$ \mathbb{P}[\text{at least } j \text{ collisions}] = \frac{1}{n^{m}} \sum_{i=1}^{n-j} \binom{n}{i} i! S(m,i). $$
Proof: As explained previously, the probability of having at least $j$ collisions and at least $j$ empty bins (or at most $n-j$ non-empty bins) is equal to the probability of tossing $m$ balls in at most $n-j$ bins. Let $A$ be the event "at most $n-j$ bins are not empty" and let $A_i$ be the event "exactly $i$ bins are not empty". Then $A = \cup_{i=1}^{n-j} A_j$ and $A_i \cap A_k = \emptyset$ if $i \neq k$, hence $\mathbb{P}[A] = \sum_{i=1}^{n-j}\mathbb{P}[A_j]$ and Lemma 1 concludes the proof. $\square$

We can see on the next plot that this results already seems closer to reality (this was plotted for $n = 500$, $m = 200$ and $j$ from $1$ to $1000$). enter image description here

Moreover, with this approach, we don't need to separate the cases $m \leq n$ and $m > n$ (at least I think, you might need to correct me on this).

The downside is that now we have to work with a sum... and Stirling numbers of the second kind...

Do you guys have any idea to remove the sum and obtain an upper bound of this?