Probability of distribution using a discrete random function

72 Views Asked by At

Problem:

Given P persons (say 1000) distribued among N rooms (say 50) using a discrete random distribution function.

What would be the probability of a room having at least K persons (say 30 or more persons in the same room) ? document

Context:

I'm trying to store large data sets (1010 entries) in smaller capacity databases (maximum 108 documents).

The idea is to group multiple entries is the same DB document using a hash function in such a way that every DB documents would store 100 data entries on average.

BUT, I'm affrayed of having DB documents with too many entries say 1000+ in the same DB (some kind of the birthdays paradox with multiple collisions)

2

There are 2 best solutions below

0
On BEST ANSWER

Let's try an easier question: what is the expected number of rooms with $K$ or more occupants?

Considering a single room, we can use a binomial distribution:

  • the probability of exactly $K$ occupants of that room is ${P \choose K}\frac{(N-1)^{P-K}}{N^P}$
  • so the probability of at least $K$ occupants of that room is $\sum\limits_{j=K}^N{P \choose j}\frac{(N-1)^{P-j}}{N^P} = 1 - \sum\limits_{j=0}^{K-1}{P \choose j}\frac{(N-1)^{P-j}}{N^P}$

So in your first example with $P=1000, N=50, K=30$ you get the probability of at least $K$ occupants of that room being about $0.0207$, while with $P=10^{10}, N=10^8, K=10^3$ you get the probability of at least $K$ occupants of that room being about $10^{-611}$

There are $N$ rooms, so the expected number of rooms with at least $K$ occupants is $N$ times the previously calculated probabilities, which in your first example is about $1.035$ and in your second is about $10^{-603}$. You can find the power of $10$ in the second case using R with

K <- 10^3; P <- 10^10; N <- 10^8
(pbinom(K-1, P, 1/N, lower.tail=FALSE, log.p=TRUE) + log(N)) / log(10) 
# -602.9884

You can be sure that the probability of any room having at least $K$ occupants is no more than this, which is not particularly helpful for the first example but is for the second since the expectation is much smaller than $1$ and so the probability is still about $10^{-603}$.

If the expected number $E$ of rooms having at least $K$ occupants is much smaller than $N$ and $N$ is large, then perhaps you could try as a Poisson approximation for the probability of any room having at least $K$ occupants the value $1-e^{-E}$, which for the first example is about $0.645$, perhaps a little low since a simulation suggested a figure closer to about $0.676$.

0
On

Let us assume you have a function $F : [[1, P]] \times [[1, N]] \to [0, 1]$ such that: $F(i, j) = \Bbb P(A_i = j)$ where $(A_i = j)$ is the event: the $i$-th person got assigned to the $j$-th room. ($F$ is computable when you know your random discrete distribution).

Now, you want to count how many persons are there in a given room, we denote $X_j$ the amount of person in the $j$-th room, so that:

\begin{equation*} X_j = \sum_{i=1}^P 1_{A_i = j} \end{equation*}

For example, by Markov's inequality:

\begin{equation*} \Bbb P(X_j \geq K) \leq \dfrac{\Bbb E(X_j)}{K} = \dfrac{\sum_{i=1}^P F(i, j)}{K} \end{equation*}

You can improve this result a lot with more assumptions on your random discrete distribution, for example, if you know that $(1_{A_i = j}))_i$ are independent for a certain $j$, you can write, by Chernoff's inequality, for $p = \Bbb E(X_j)$ and $\varepsilon > 0$ :

\begin{equation*} \Bbb P\left(\dfrac{1}{P}X_j \geq b + \varepsilon\right) \leq \exp(-2\varepsilon^2 P) \end{equation*}

To illustrate this, let us take $F(i, j) = \dfrac{1}{N}$, the uniform distribution.

Now, we have: $\Bbb E(X_j) = \dfrac{P}{N}$ (say $20$ person per room in average).

Finally, $\Bbb P(X_j \geq K) \leq \dfrac{P}{NK}$, so the probability of having, say 30 persons in any room is bounded by $\dfrac{20}{30} = \dfrac{2}{3}$.

Better, by Chernoff's inequality, we want to evaluate:

\begin{align*} \Bbb P(X_j \geq K) & = \Bbb P\left(\dfrac{1}{P} X_j \geq \dfrac{K}{P}\right) \end{align*}

Thus, we take: $\varepsilon = \dfrac{K}{P} - b = \dfrac{K}{P} - \dfrac{1}{N}$.

So that: $\Bbb P(X_j \geq K) \leq \exp\left[-2\left(\dfrac{K}{P} - \dfrac{1}{N}\right)^2 P\right]$.

With the same example, for $K = 30$, Chernoff gives the bound: $0.819$ which is worse than Markov.

But, if you want to look at $K=50$, Markov gives the bound: $0.4$ and Chernoff gives the bound: $0.165$.

Here is a nice chart which shows the comparison between two inequalities, notice that one requires an assumption of independence and another does not:

Chernoff's inequality vs Markov's inequality for K from 20 to 100]