The other night I was hanging out with some friends and someone put on a playlist on shuffle random, where the songs are drawn uniformly at random from a fixed playlist. The person who put the playlist together forgot how many songs were in it, so the topic came up of how to estimate the size of the playlist purely based on what we hearing.
We came up with a few high-level ideas for how to do this. For example, using ideas from the Birthday Paradox, we thought we could listen until we heard a song repeated for the first time, then use that to make an educated guess about how many songs were on the playlist in total. We also thought that we could listen for a long time and build up a frequency histogram of the number of times each song was played, then use the fact that it should look somewhat normally distributed to get the mean and variance and from there estimate the total number of songs on the list.
None of us are statisticians or have a lot of training in machine learning, but I suspect that this is probably a well-studied problem and that there are some really nice techniques we can use to estimate the playlist size.
Are there a good family of techniques for estimating the playlist size? From a practical perspective, would any of these techniques be something that would be relatively easy to work out without a computer or calculator?
Thanks!
From the Langford&Langford paper, suppose you have heard $m$ songs, of which $i$ were different, and $i<m$. You wish to find the size of the playlist, whose most likely value (in the maximum likelihood sense) is $\hat{N}$, which is given by $$\hat{N}=\left\lfloor \frac{1}{1-y^\star} \right\rfloor$$ where $\lfloor \cdot \rfloor$ denotes the floor function, and $y^\star$ denotes the smaller positive root of the polynomial $$y^m-iy+(i-1)=0$$
Note: if $i=m$, then all the songs you've heard so far have been distinct. Then there is no good way to estimate the size of the playlist; it might be infinite for all we know.
For convenience, here is a table of values for $\hat{N}$, all for $m=10$. For example, if you listen to $m=10$ songs and hear $i=7$ different ones among them, then the most likely size of the playlist is $\hat{N}=12$.
\begin{array} {|r|r|} \hline i & \hat{N}\\ \hline 1 &1 \\ 2&2\\ 3&3\\ 4&4\\ 5&5\\ 6&8\\ 7&12\\ 8&19\\ 9&42\\ \hline \end{array}