Asymptotic convergence of a count-distinct estimator

156 Views Asked by At

Consider the following variant of the count-distinct problem.

Problem Setting. Let $B = \{ b_1, \dots, b_n \}$ be a finite set of $n$ balls, let $C$ be a set of colours, let $F : B \to C$ be a function that assigns a distinct colour to each ball, and let $D = \{ F(b) \mid b \in B \}$ be the set of distinct colours occurring in $B$. The problem is to estimate $|D|$. Given balls $b_i$ and $b_j$, determining $F(b_i)$ and checking $i = j$ (i.e., checking ball identity) is computationally cheap; checking $i < j$ is much more computationally involved (but feasible); and computing $i - j$ is infeasible.

Disclaimer. The above is an abstraction of a considerably more complex problem from database research. Hence, most solutions to the count-distinct problem I am aware of are not applicable, and I really need the answer to the specific question I pose below.

I next present two sampling-based algorithms that estimate $|D|$, after which I present a question about asymptotic convergence of the second one.

Algorithm 1. Assume that we are "magically" given a function $R : D \to B$ that assigns to each colour $c \in D$ occurring in $B$ an arbitrarily chosen representative ball $R(c) \in B$. Now consider the following.

Choose a ball b in B uniformly at random.
If R(F(b)) =  b, then return n; otherwise return 0

It is easy to see that this provides an unbiased estimator of $|D|$. The estimator's variance is large; however, if we take the average of a large number of runs, the result almost surely converges to $|D|$ by the strong law of large numbers, and the variance decreases.

The problem with Algorithm 1 is that $R$ must be chosen in advance, which defeats the point of having an estimator in the first place. This leads me to the following, perhaps more realistic algorithm.

Algorithm 2. Let $R$ be a global mapping of colours occurring in $D$ to representative balls. This mapping is initially empty, and it can be updated in successive runs of the algorithm. Now consider the following.

Choose a ball b in B uniformly at random.
If R(F(b)) is undefined, then set R(F(b)) := b.
If R(F(b)) =  b, then return n; otherwise return 0.

Let $X_1, X_2, \dots$ be random variables that model the outcomes of repeated invocations of Algorithm 2. Because $R$ is global, these random variables are neither independent nor identically distributed. Moreover, no $X_i$ provides an unbiased estimator of $|D$|; for example, the only realisation of $X_1$ is $n$. Nevertheless, we can run Algorithm 2 repeatedly and take the average.

Question. Let $Y_k = \frac{1}{k} \sum_{i=1}^k X_i$. I would ideally like to prove that $Y_k$ converges almost surely to $|D|$. Actually, I would be happy with proving any forms of convergence (in probability or in mean) if that is easier to show.

Observations. Although variables $X_i$ are not i.i.d., Algorithm 2 provides a "dubious" estimate only when it finds $R(f(b))$ undefined; otherwise, Algorithm 2 behaves like Algorithm 1 so the result is "not dubious". Moreover, set $D$ is finite, so the number of "dubious" estimates that we can get in any finite sequence of runs of Algorithm 2 is bounded. Thus, I expect the contribution of these "dubious" estimates to the average to tend towards zero as the number of runs increases. I also have decent empirical evidence that this algorithm performs well in practice and has the desired property. However, I am struggling to provide a sound formal proof.

What confuses me particularly is Algorithm 2 can have infinite runs where $R$ never gets fully populated; for example, one can keep choosing $b_1$ to infinity. Such runs have zero probability, but they are possible. Hence, saying something like "after some number of steps, Algorithm 2 becomes Algorithm 1 and so we can disregard all 'dubious' estimates produced thus far" does not seem sufficiently rigorous to me.

Any help with this would be greatly appreciated, and of course I would give proper credit to anyone who can help me with this. (We can discuss this privately.)

2

There are 2 best solutions below

5
On BEST ANSWER

As you write, for any given function $R$, the corresponding unbiased estimator of Algorithm $1$ almost surely converges to $|D|$. There is a finite number of such functions. Since the probability for the estimator not to converge is $0$ for each function, the union of this finite set of events also has probability $0$. Thus, almost surely the estimators for all possible functions $R$ converge to $|D|$.

It’s not just that individual runs that don’t eventually fully populate $R$ have probability $0$ (in fact, all individual runs have probability $0$), but the entire event that $R$ isn’t eventually fully populated has probability $0$. (Its probability is less than or equal to $n$ times the probability that some ball is never drawn, and that probability is bounded from above by a geometric sequence, so it’s $0$.) Thus, since you only want to show that $Y_k$ almost surely converges to $|D|$, we can disregard such runs.

So with probability $1$, for the instantiated run there is an index $m$ at which $R$ is fully populated. Then for $k\gt m$ we have

\begin{eqnarray*} Y_k&=&\frac1k\sum_{i=1}^kX_i \\ &=&\frac1k\left(\sum_{i=1}^mX_i+\sum_{i=m+1}^kX_i\right) \\ &=&\frac1k\sum_{i=1}^mX_i+\frac{k-m}k\frac1{k-m}\sum_{i=m+1}^kX_i\;. \end{eqnarray*}

As $\frac1{k-m}\sum_{i=m+1}^kX_i$ is a shifted form of the estimator of Algorithm $1$ for some function $R$, it almost surely converges to $|D|$. The first term is a constant times $\frac1k$, so it converges to $0$, and $\frac{k-m}k$ converges to $1$. Thus, $Y_k$ itself almost surely converges to $|D|$.

5
On

A big thanks to @joriki for his great answer. I am posting this to fill in some details that initially confused me and to make the proof as transparent as possible.

Setting. Let $\Omega$ be the sample space describing infinite sequences of drawing balls from $B$; thus, each $\omega \in \Omega$ is an infinite sequence of the form $c_1,c_2,...$ where $c_i \in B$ is the ball drawn in the $i$th step. Let $X_1, X_2, \dots$ be the sequence of random variables where $X_i(\omega)$ is the result of the $i$th invocation of Algorithm 2 on any given $\omega \in \Omega$. Also, let $Y_1, Y_2, \dots$ be the sequence of random variables where $Y_k = \frac{1}{k} \sum_{i=1}^k X_k$. Finally, let $P$ be the probability function on $\Omega$.

Objective. To show that sequence $Y_k$ converges almost surely to $|D|$, we need to prove $P(\Omega_1) = 1$, where $$\Omega_1 = \{ \omega \in \Omega \mid \lim_{k \to \infty} Y_k(\omega) = |D| \}.$$

Proof. Let $$\Omega_2 = \{ \omega \in \Omega \mid \exists m \text{ in } \omega \text{ at which function } R \text{ becomes fully defined in Algorithm 2} \}.$$ As per @joriki's argument, $P(\Omega_2) = 1$.

Now, for each $\omega \in \Omega_2$ and the corresponding $m$, let $Z^\omega_{m+1}, Z^\omega_{m+2}, \dots$ be random variables describing the outcomes of Algorithm 1 parameterised by the resulting representative function on steps $m+1, m+2, \dots$ of $\omega$. Furthermore, let $$\Omega_3 = \{ \omega \in \Omega_2 \mid \lim_{i \to \infty} Z^\omega_{m+i}(\omega) = |D| \}.$$ Now consider an arbitrary $\omega \in \Omega_3$ and the corresponding $m$. Then, $$\lim_{i \to \infty} Z^\omega_{m+i}(\omega) = |D| \;\;\; \text{ implies } \;\;\; \lim_{k \to \infty} Y_k(\omega) = |D|$$ as shown by @joriki, and so $\omega \in \Omega_1$. Thus, have $\Omega_3 \subseteq \Omega_1$.

We next show $P(\Omega_3) = 1$. Let $R^1, \dots, R^p$ be all possible instantiations of the representative function; note that $p$ is finite. Furthermore, for $1 \leq i \leq p$, let $$\Omega_2^i = \{ \omega \in \Omega_2 \mid \omega \text{ instantiates the representative function to } R^i \}.$$ Clearly, $\Omega_2 = \bigcup_{i=1}^p \Omega_2^i$, which implies $1 = P(\Omega_2) = \sum_{1=1}^p P(\Omega_2^i)$. Moreover, $\Omega_3 = \bigcup_{i=1}^p \Omega_3 \cap \Omega_2^i$. Now, for each $i$ with $1 \leq i \leq p$, Algorithm 1 instantiated by $R^i$ is an unbiased estimator of $|D|$, so the strong law of large numbers ensures $P(\Omega_3 \mid \Omega_2^i) = 1$. But then, $P(\Omega_3) = \sum_{i=1}^p P(\Omega_2^i) \cdot P(\Omega_3 \mid \Omega_2^i) = 1$.

Finally, $\Omega_3 \subseteq \Omega_1$ and $P(\Omega_3) = 1$ imply $1 = P(\Omega_3) \leq P(\Omega_1) \leq 1$, which proves our objective.