Expected value of $\max\limits_{1\leq i\leq m}X_i-\min\limits_{1\leq i\leq m} X_i$ for balls and bins problem

430 Views Asked by At

Recently I came across the following problem which has occupied me for a few weeks. This is just out of curiosity, and also I can't find any research paper on this particular question .

Problem : Given $N$ balls, each ball is placed in a box chosen uniformally at random among $m $ numbered boxes, Let $X_i $ denote the random variable counting the number of balls in box $i$. Define $Y=\max\limits_{1\leq i\leq m}X_i-\min\limits_{1\leq i\leq m} X_i$. What is the expected value and the variance of the variable $Y$?

I am looking for any reference or research paper providing an answer (approximations) to the problem when $\mathbf{N}$ is large and $m$ is constant.

2

There are 2 best solutions below

4
On BEST ANSWER

Not a satifactory answer, but here is a small observation: If $M_m := \mathbb{E}\left[\max_{1\leq i\leq m} W_i \right]$ denotes the expectation of the maximum of $m$ i.i.d. standard normal variables $W_1, \cdots, W_m$, then we have

$$\lim_{N\to\infty} \frac{1}{\sqrt{N}}\mathbb{E}\left[ \max_{1\leq i \leq m} X_i - \min_{1\leq i \leq m} X_i \right] = \frac{2M_m}{\sqrt{m}} . \tag{*}$$

The idea is as follows. Roughly speaking, the main source of the correlation between $X_i$'s is the condition that the total sum $X_1 + \cdots + X_m$ is fixed. That is, we can loosely think of each $X_i$ as the following re-centered version

$$ X_i \quad {``}\approx{\text{''}} \quad \tilde{X}_i + \frac{N}{m} - \frac{\sum_{k=1}^m\tilde{X}_k}{m} $$

of some i.i.d. random variables $\tilde{X}_1, \cdots, \tilde{X}_m$. Since this re-centering does not affect the difference, we expect that $Y \approx \tilde{Y} := \max_i \tilde{X}_i - \min_i \tilde{X}_i$. Taking advantage of the independence, it is easier to investigate $\tilde{Y}$ instead.


Here is a more detailed argument. If we consider

$$ \tilde{Z}_i = \frac{X_i - (N/m)}{\sqrt{N/m}} $$

then the random vector $\tilde{Z} = (\tilde{Z}_1, \cdots, \tilde{Z}_m)$ converges in distribution to the multivariate normal distribution $\mathcal{N}(0, C)$ with the covariance matrix $C$ satisfying

$$ C_{ij} = \delta_{ij} - \frac{1}{m}.$$

So we consider a multivariate normal vector $Z = (Z_1, \cdots, Z_m) \sim \mathcal{N}(0, C)$. We also introduce a normal variable $\bar{Z} \sim \mathcal{N}(0, \frac{1}{m})$ which is independent of $Z$. Finally, consider $W_i = Z_ i + \bar{Z}$. Then we easily compute that

$$ \operatorname{Cov}(W_i, W_j) = C_{ij} + \frac{1}{m} = \delta_{ij}, $$

from which we find that $(W_i)$ is an i.i.d. standard normal variables. Since $Z_i - Z_j = W_i - W_j$, we may study the maximum and minimum of $W$ instead. So

\begin{align*} \mathbb{E}\left[\max_i Z_i - \min_i Z_i\right] = \mathbb{E}\left[\max_i W_i - \min_i W_i\right] = 2\mathbb{E}\left[\max_i W_i\right] = 2M_m \end{align*}

Using this we can prove that $\text{(*)}$ holds.

3
On

Have you seen: