Expected value of randomly filling buckets until conflict

164 Views Asked by At

The problem is directly related to the famous birthday paradox, but with a twist/complication.

problem

We have $n$ buckets and we randomly put balls into them. What’s the expected value of balls to be put if we stop after the first collision—attempt to put a ball into an already filled bucket?

What if we have $s$ tries for every ball?

context

What makes things easier, is that I need only an approximate solution and I can use computer to do the heavyweight calculus. What makes things harder is that I’d like to solve that for values $n$ somewhere around $[2^{32};2^{64}]$ so calculating the sum the most straightforward way isn’t feasible.

XY problem, alternatives

The actual problem behind is evaluating efficiency of a [kind of] hash-table implementation. So if there’s a way to estimate efficiency, by calculating another function of distribution, I’m OK with that. Actually, in the birthday problem we find a number of items $k$ such that collision probability is $\approx 0.5$. I was able to “experimentally” find that for $s = 1$, $k \propto n^{1/2}$ and for $s = 2$ $k \propto n^{2/3}$ which leads to extrapolation for $s = 4$: $k \propto n^{4/5}$, but that’s too speculative and I’m not sure finding that $k$ is meaningful for my case.

where I am currently with solving

In the case $s = 1$ The value would be:

$$\frac{1}{n}\cdot1 + \frac{n-1}{n}\frac{2}{n}\cdot2 + \frac{n-1}{n}\frac{n-2}{n}\frac{3}{n}\cdot3+\cdots$$ or $$\frac{1}{n}\sum_{k=1}^{n} \frac{n!}{(n-k)!}k^{2}\frac{1}{n^{k}}$$ which is not very hard to solve approximately. E.g. by turning a sum into an integral with factorial expressed by Stirling’s approximation and then it might be solved symbolically (haven’t tried TBH).

I actually wanted to solve the problem for $s = 4$, but generic solution would be better.

$$\frac{1}{n^s}\cdot1 + \frac{n^s-1}{n^s}\frac{2^s}{n^s}\cdot2 + \frac{n^s-1}{n^s}\frac{n^s-2^s}{n^s}\frac{3^s}{n^s}\cdot3+\cdots$$ or $$\frac{1}{n^s}\sum_{k=1}^{n} k^{s+1}\prod_{j=0}^{k-1}\frac{n^s - j^s}{n^s}$$

For $s = 2$ we are getting $a^2 - b^2 = (a-b)(a+b)$, which easily collapses into pretty simple factorials combinations like for $s = 1$, but for $s = 4$ I found no way to simplify the expression.

2

There are 2 best solutions below

0
On

For the first collision, we have the generalized birthday problem. To get a $\frac 12$ chance of a collision (not the same as the expected number, but not far off) you need $\sqrt {2 \ln 2 n}$ balls.

For $s$ tries per ball, I will assume for each ball you pick a random bucket. If the bucket is occupied, you again pick a random bucket, which could be the same as the one you just tried. If you get $s$ occupied buckets for a single ball you stop.

The first ball succeeds. The second fails with probability $\frac 1{n^s}$ because you have to hit the one occupied bucket $s$ times. Assuming the second succeeded, the third fails with probability $\frac {2^s}{n^s}$ because there are two occupied buckets to hit. The chance you succeed out to $k$ balls is $$\prod_{i=1}^k\left(1-\frac {(i-1)^s}{n^s}\right)$$

I wrote a quick program. With $s$ down the first column and $n$ across the top, I find the number of balls to have a $\frac 12$ chance of collision is as follows $$\begin {array} {r|r|r|r} \ &10^6&10^7&10^8\\ \hline 1&1,178&3,724&11,775 \\2&12,765&59,245&274,990\\3&40,807&229,468&1,290,392\\ 4&80,903&510,458&3,220,766\\5&126,814&863,966&5,886,125\end {array}$$ The first line checks against the Wikipedia formula.

0
On

I also measured with a program, but this time it was the mean value:

function mean(s, n) {
  var sum = 0;
  var prod_log = 0;
  for (var k = 1; k <= n; k++) {
    sum += k ** (s+1) * Math.exp(prod_log);
    prod_log += Math.log1p(-1 * (k/n) ** s);
  }
  return sum / n**s;
}

warning: measuring $n=2^{30}$ takes a whole minute

Results fit empirical formula $$k \propto n^{\frac{s}{s+1}+\varepsilon}$$

$$\begin {array} {r|r} \ s&2^{10}&2^{15}&2^{20}&2^{25}&2^{30}&k=f(n)\\ \hline 1&5.314&7.824&10.325&12.826&15.325&\log k\approx\frac{1}{2}\log n+0.325 \\2&7.028&10.365&13.698&17.032&20.365&\log k\approx\frac{2}{3}\log n+0.365 \\ 4&8.340&12.341&16.341&20.341&24.341&\log k\approx\frac{4}{5}\log n+0.341\\9&9.260&13.760&18.260&22.760&27.260&\log k\approx\frac{9}{10}\log n+0.260\end {array}$$