Expected number of elements before 1 out of 2 hash tables are full?

147 Views Asked by At

From a previous asked question it became clear to me that the expected number of elements to fill up one hash table (assuming ideal randomness) is just the Harmonic number: $mH_m$ where $m$ is the size of the table (collisions could just be solved with chaining or whatever. Don't care.)

However, a more tricky question is:

What is the expected number of elements we need before one of the hash tables are full? I.e. I want to know how fast it takes before one of the tables are full. We can assume that choosing either table $a$ or $b$ has probability $P[a|b]= \frac{1}{2}$ (see figure below).

choosing either one

2

There are 2 best solutions below

11
On BEST ANSWER

There might be an elegant explanation of the result, but here is the bruteforce calculation.


We fix $m$ and define $E(a, b)$ as the expected number of elements to fill one table, if there are still $a$ and $b$ empty entries in the two tables.

Thus $E(0, *) = E(*, 0) = 0$ and we are looking for $E(m, m)$.

For $a, b > 0$, by considering the possibilities of the next element, we have: $$E(a, b) = 1 + \frac a {2m}E(a - 1, b) + \frac b {2m} E(a, b - 1) + (1 - \frac{a + b}{2m})E(a, b)$$ which simplifies to: $$(a + b) E(a, b) = 2m + aE(a - 1, b) + bE(a, b - 1).$$

If we define $F(a, b) = \frac 1 {2m} E(a, b)$, then we have the recurrence relation $$(a + b) F(a, b) = 1 + aF(a - 1, b) + bF(a, b - 1).$$


It is not totally obvious, but one can show by induction that $F(a, b) = H_a + H_b - H_{a + b}$ for all $a, b \geq 0$, where $H_n = \frac 1 1 + \dots + \frac 1 n$ denotes the harmonic number.

Proof: We simply need to verify that $$(a + b)(H_a + H_b - H_{a + b}) = 1 + a(H_{a - 1} + H_b - H_{a + b - 1}) + b(H_a + H_{b - 1} - H_{a + b - 1})$$ which is a routine calculation.


Therefore we have $F(m, m) = 2H_m - H_{2m}$ and hence $$E(m, m) = 2m F(m, m) = 4mH_m - 2mH_{2m}.$$

2
On

At the prompting of @user, I wanted to give an intuitive explanation of the formula found the in other answer.

Let $T_1$ be the time it takes to fill the first hash table, and $T_2$ be the time it takes to fill the second. You want $E[\min(T_1,T_2)]$. Since $\min(T_1,T_2)=T_1+T_2-\max(T_1,T_2)$, this is the same as $$ E[T_1]+E[T_2]-E[\max(T_1,T_2)] $$ Note that $\max(T_1,T_2)$ is the time it takes to fill out both hash tables, which is just like filling out a single larger hash table with size $2m$. Therefore, $E[\max(T_1,T_2)]=2m H_{2m}$, using what you already knew.

Symmetry implies $E[T_1]=E[T_2]$, but what is $E[T_1]$? It is not the same as the single hash table problem, since only about half of the hashes contribute to the first table. If every hash went to the first table, the expected time would be $mH_m$, so it stands to reason the expected time when you are going at half-speed should be $2mH_m$. This can be made rigorous using Wald's equation and an appropriate stochastic process, but hopefully this is at least believable.

Putting this altogether, you get $E[\min(T_1,T_2)]=2mH_m+2mH_m-2mH_{2m}$, as in the other answer.