Consider the following variant of the count-distinct problem.
Problem Setting. Let $B = \{ b_1, \dots, b_n \}$ be a finite set of $n$ balls, let $C$ be a set of colours, let $F : B \to C$ be a function that assigns a distinct colour to each ball, and let $D = \{ F(b) \mid b \in B \}$ be the set of distinct colours occurring in $B$. The problem is to estimate $|D|$. Given balls $b_i$ and $b_j$, determining $F(b_i)$ and checking $i = j$ (i.e., checking ball identity) is computationally cheap; checking $i < j$ is much more computationally involved (but feasible); and computing $i - j$ is infeasible.
Disclaimer. The above is an abstraction of a considerably more complex problem from database research. Hence, most solutions to the count-distinct problem I am aware of are not applicable, and I really need the answer to the specific question I pose below.
I next present two sampling-based algorithms that estimate $|D|$, after which I present a question about asymptotic convergence of the second one.
Algorithm 1. Assume that we are "magically" given a function $R : D \to B$ that assigns to each colour $c \in D$ occurring in $B$ an arbitrarily chosen representative ball $R(c) \in B$. Now consider the following.
Choose a ball b in B uniformly at random.
If R(F(b)) = b, then return n; otherwise return 0
It is easy to see that this provides an unbiased estimator of $|D|$. The estimator's variance is large; however, if we take the average of a large number of runs, the result almost surely converges to $|D|$ by the strong law of large numbers, and the variance decreases.
The problem with Algorithm 1 is that $R$ must be chosen in advance, which defeats the point of having an estimator in the first place. This leads me to the following, perhaps more realistic algorithm.
Algorithm 2. Let $R$ be a global mapping of colours occurring in $D$ to representative balls. This mapping is initially empty, and it can be updated in successive runs of the algorithm. Now consider the following.
Choose a ball b in B uniformly at random.
If R(F(b)) is undefined, then set R(F(b)) := b.
If R(F(b)) = b, then return n; otherwise return 0.
Let $X_1, X_2, \dots$ be random variables that model the outcomes of repeated invocations of Algorithm 2. Because $R$ is global, these random variables are neither independent nor identically distributed. Moreover, no $X_i$ provides an unbiased estimator of $|D$|; for example, the only realisation of $X_1$ is $n$. Nevertheless, we can run Algorithm 2 repeatedly and take the average.
Question. Let $Y_k = \frac{1}{k} \sum_{i=1}^k X_i$. I would ideally like to prove that $Y_k$ converges almost surely to $|D|$. Actually, I would be happy with proving any forms of convergence (in probability or in mean) if that is easier to show.
Observations. Although variables $X_i$ are not i.i.d., Algorithm 2 provides a "dubious" estimate only when it finds $R(f(b))$ undefined; otherwise, Algorithm 2 behaves like Algorithm 1 so the result is "not dubious". Moreover, set $D$ is finite, so the number of "dubious" estimates that we can get in any finite sequence of runs of Algorithm 2 is bounded. Thus, I expect the contribution of these "dubious" estimates to the average to tend towards zero as the number of runs increases. I also have decent empirical evidence that this algorithm performs well in practice and has the desired property. However, I am struggling to provide a sound formal proof.
What confuses me particularly is Algorithm 2 can have infinite runs where $R$ never gets fully populated; for example, one can keep choosing $b_1$ to infinity. Such runs have zero probability, but they are possible. Hence, saying something like "after some number of steps, Algorithm 2 becomes Algorithm 1 and so we can disregard all 'dubious' estimates produced thus far" does not seem sufficiently rigorous to me.
Any help with this would be greatly appreciated, and of course I would give proper credit to anyone who can help me with this. (We can discuss this privately.)
As you write, for any given function $R$, the corresponding unbiased estimator of Algorithm $1$ almost surely converges to $|D|$. There is a finite number of such functions. Since the probability for the estimator not to converge is $0$ for each function, the union of this finite set of events also has probability $0$. Thus, almost surely the estimators for all possible functions $R$ converge to $|D|$.
It’s not just that individual runs that don’t eventually fully populate $R$ have probability $0$ (in fact, all individual runs have probability $0$), but the entire event that $R$ isn’t eventually fully populated has probability $0$. (Its probability is less than or equal to $n$ times the probability that some ball is never drawn, and that probability is bounded from above by a geometric sequence, so it’s $0$.) Thus, since you only want to show that $Y_k$ almost surely converges to $|D|$, we can disregard such runs.
So with probability $1$, for the instantiated run there is an index $m$ at which $R$ is fully populated. Then for $k\gt m$ we have
\begin{eqnarray*} Y_k&=&\frac1k\sum_{i=1}^kX_i \\ &=&\frac1k\left(\sum_{i=1}^mX_i+\sum_{i=m+1}^kX_i\right) \\ &=&\frac1k\sum_{i=1}^mX_i+\frac{k-m}k\frac1{k-m}\sum_{i=m+1}^kX_i\;. \end{eqnarray*}
As $\frac1{k-m}\sum_{i=m+1}^kX_i$ is a shifted form of the estimator of Algorithm $1$ for some function $R$, it almost surely converges to $|D|$. The first term is a constant times $\frac1k$, so it converges to $0$, and $\frac{k-m}k$ converges to $1$. Thus, $Y_k$ itself almost surely converges to $|D|$.