I am thinking on the follow probabilistic problem setting:
Say we have $n$ boxes, each of which has an affiliated set of $k$ balls. We aim to distribute these balls across the boxes in the most fair manner (each receiving an equal number of balls). The contributed balls are broken down as follows:
- The first $n-1$ boxes each contribute $k$ balls which can only be given to themselves (call them ''specific'' to these boxes). As such, these $(n-1)k$ balls have no ambiguity as to where they can be distributed.
- For the final box, we have $k$ balls which each have a probability equal to $1/2$ of being "specific" to that box or non-specific (can be put in any box).
If we are given all the balls at once (an ''offline'' setting), we can trivially distribute so that each box has $k$ balls.
My question is this, somewhat a coupon collector problem:
If the balls are arriving one-by-one (randomly), how many do we have to see before we can identify which box is the unique "final" box (or with high probability make this conclusion)? I have analyzed this for the setting where the final box instead contributes only non-specific items (or the probability of being specific is $p=0$), which is almost exactly a coupon collector problem (once we have seen one of each specific item, we know the box that is not specific). However, for the more complex case that the special box has some specific items, it seems that we would need to sample more to identify this box?
EDIT: It is also important to note that when $k=\infty$ we are much closer to a coupon collector instance. Additionally, with $k=1$ the problem is trivial so we are assuming $k$ is large in some sense.
This is obviously a complicated problem, but a rough estimate is available for $k=\infty$. In that case, we expect to need $n\log n+(m-1)n\log\log n+O(n)$ balls to have at least $m$ specific balls in each normal box (see Wikipedia). The expected number of specific balls in the final box at this point is
$$\frac1{2n}\left(n\log n+(m-1)n\log\log n+O(n)\right)=\frac12\left(\log n+(m-1)\log\log n\right)+O(1)\;.$$
For this to be less than $m$, we need $$\frac12\left(\log n+(m-1)\log\log n\right)\lt m$$
$$\rightarrow m\gt\frac{\log n-\log\log n}{2-\log\log n}\;.$$
This diverges for $\log\log n=2$, that is, for $n=\mathrm e^{\mathrm e^2}\approx1618$. Beyond that point, the number of specific balls in the final box grows faster than the minimal number of specific balls in the normal boxes, so you can never hope to reliably identify the final box.
If $k$ is finite, you need to see an appreciable proportion of the $kn$ balls until it becomes sufficiently likely for the underpopulated boxes to be filled.