I'm trying to sample from an infinite stream of data without having to store all past items somewhere.
I immediately stumbled upon reservoir sampling and from what I understand, it consist in having a reservoir of $k$ items and progressively replace items from the reservoir with items coming from the stream in a way the reservoir is a true sample of the distribution at the end of the execution.
So assuming $n$ steps have passed and the data distribution of items of the stream is $D$, my understanding is that:
- each sampled $x_i \sim D$, so if I stop at any point, I am guaranteed to have sampled $k$ items from $D$ in the reservoir
- the probability of replacing an item of the reservoir by a new item decreases with the running time $n$, in the order of $O(1/n)$
This second point is what bothers me, I would like to have items of the reservoir all remain the same amount of time in expectation. In other words, I would like to have: $\forall i \in \mathbb N, \mathbb E[t_i]=c$ where $t_i$ is the number of steps item $i$ was in the reservoir and $c$ a constant. This is because I would like to have a machine learning algorithm learn from items in the reservoir and avoid over-fitting on the items that would have remained longer in the reservoir. I am not sure if this is already a property of reservoir algorithms, but intuitively I would say it is not since at the start of the algorithm many more replacements should happen, lowering the expected time of these items compared to items coming in later replacements.
So I have some questions:
- is it possible to design such sampling algorithm that respect equal sampling probability and equal time in the reservoir in expectation? Or are the two objectives fundamentally incompatible?
- If yes, what is an algorithm that meet requirements? Is the constant $c$ an hyperparameter of the algorithm?
- If no, what would be an algorithm fulfilling the second requirement but "close" in spirit to the first?