Finding optimal cut-off point for a bi-uniform distribution in a sampling without replacement

64 Views Asked by At

As part of a project investigating the properties of a caching architecture, I have been working on sampling without replacement. Fix $N \in \mathbb{N}$ and $k << N$ and a distribution $P$ on $N$ (typically Zipf's law with fairly large $s$) such that $p_i$ is a decreasing function in terms of $i$. We define a random process $X_k$ that "caches" $k$ unique elements and then samples a final element from $P$. $X_k$ returns 1 if the final element was already one of the $k$ cached elements and 0 otherwise. Computing $\mathbb{P}(X_k = 1)$ exactly is pretty hard, so I am interested in finding another way of approximating its value through a fairly tight upper bound. My idea is to leverage a bi-uniform distribution family, that is the sum of two constant distributions (see the Figure for an example), to upper-bound $\mathbb{P}(X_k = 1)$ for any given distribution $P$.

Example of bi-uniform distribution

For a given $k \in \mathbb{N}$, I am interested in finding the closest bi-uniform to my original distribution. Each bi-uniform is defined by two parameters: (1) a cut-off point $c_p < N$ and (2) a height for the first bi-uniform $h \in [0,1]$. I am looking for a strategy to discover the tightest bi-uniform distribution for every possible distribution $P$ and $k$. By tightest, I mean that I am looking for the parameters $(c^*,h^*)$ such that for a given distribution $P$, the quantity $\mathbb{P}_P(X_k) - \mathbb{P}_{bi-uniform}(X_k)$ is minimized and is larger than 0. For example if $k = 1$, then the tightest bi-uniform's parameters are $c_p = 1$ and $h = p_1$. In particular, I am interested in proving the following lemma:

Lemma: Assume that for a given distribution $P$ and a fixed $k$, we have identified a tightest bi-uniform with parameters ($c^*$,$h^*$). Then the optimal bi-uniform's cut-off for the distribution $P$ and $k+1$ has to be larger or equal to $c^*$.

I have tried to prove it by induction but I am struggling to finalize the argument. Any help would be extremely appreciated!