Inverse of percentage compounding while knowing only the percentage

23 Views Asked by At

Forewarning: This might overlap a bit with programming, but I felt that overall it fit better here than on StackOverflow. If I am incorrect I can move or cross-post this as required.


Context:

I'm writing software that takes a random sampling of a list of unknown length. The sampling is thus done by a percentage chance of selecting any given item. This sampling is used (after some other processing) to produce another list, on which the action repeats. The contents and length of the new list are based on the sampled portion of the old list. This repeats several times - up to 5, on some branches of the tree.

Unfortunately, this results in significant compounding - a few iterations in, a list is returned empty and later steps cannot complete. I would like to avoid this, in a way other than by setting the initial percentage very high.

Due to memory constraints, I cannot have an entire input list at one time (computed lazily from results of previous list-sampling), and so cannot know its length while sampling from it. I can know the length of a list after I have finished sampling from it.


Example:

Without sampling, using the whole list:

List $L$ is length $l$. Processing $L$ produces list $M$, of length $m$.

With sampling at 5$\%$, current state:

Sampling 5$\%$ of list $L$ produces list $L_5$, of average length $l*0.05$. Processing $L_5$ to get intermediate list $M'$ and sampling 5$\%$ of $M'$ produces $M_5$, of average length $m*0.05^2$ (less consistent average than for $L_5$'s length, depending on which elements were selected from $L$).

I need to know what percentage of $M'$ to sample in order for $M_5$ to be, on average, length $m*0.05$, without knowing the length of $M$ or $M'$ until after the sampling is done. The length of $L_5$ can be known. The length of $L$ cannot be known but may be estimated based on the length of $L_5$. Sampling all of $M$ and then dropping most of the results afterwards is not an option. This needs to work for arbitrary percentages (between 0 and 100), not just 5%.