"How much of the big picture have I seen already?" problem

187 Views Asked by At

TL;DR / Theoretic formulation

Let $X_1, X_2, ..., X_k$ be a sequence of i.i.d. random variables with $X_i \sim \mathcal{U}\{1, 2, ..., n\}$ (discrete uniform distribution). The parameter $n$ is unknown. Let $U$ be the number of unique values seen in the sequence. Given $k$ and $U$, how to estimate $n$?


The problem

Let $A = \{a_1, a_2, ..., a_n\}$ be a set of "concepts" that you will have to learn, for example for a university course.

As you don't see the big picture of the topic yet, you don't have an idea of how big $n$ is, and you want to estimate $n$.

In a few days, you speak with $k$ different people - each time it's a one-to-one discussion with a friend who studied the topic before, and who is supposed to master all the $n$ concepts perfectly. Each friend chooses a concept uniformly-randomly in the set $A$, and speaks to you about it.

At the end of the $k$ discussions, you write down $u$, the number of unique concepts seen.

Goal: how to estimate $n$, as a function of $k$ and $u$?

Example 1: if among $k=15$ discussions, one concept has been discussed twice (so there's 1 repetition), another concept has been discussed three times (2 repetitions), and all others just once, then the total number of repetitions is $3$, and the number of unique concepts discussed is $u=12$. Then you can imagine that if you'd proceed with even more discussions, there will be many new unseen concepts to appear. Thus $n$ is probably large.

Example 2: if among $k=15$ discussions, one concept has been discussed $10$ times, another $3$ times, and two other concepts once each, then $u = 4$. As many people spoke about the same concept (however they chose it randomly!), it gives the impression that you probably have seen most of the whole subject. Thus $n$ is not so big.

How to estimate $n$ ?


Other applications

A) You are new in the city, and you don't know how big ($n$) it is. Each time you go to a bar you see new people you haven't seen before => $k \approx u$ => $n$ is probably big. If on the other hand, you always see the same people again and again ($u$ much smaller than $k$), then $n$ is probably small.

B) You're writing a book, and ask $k$ friends to proofread it. Each friend reports $1$ mistake randomly among the mistakes he has seen. If the total number of unique mistakes seen by the readers is much much less than $k$, i.e. many repetitions in the reported mistakes, it means you've probably covered most of the mistakes (assuming the distribution is uniform, etc.).
If, on the other hand, each proofreader reports a mistake that nobody else has reported, this is bad news for you: there are probably many more mistakes to find in the book!

2

There are 2 best solutions below

7
On BEST ANSWER

Here is two approachs (there are others)

A. The probability of seeing $u$ unique values when picking $k$ from $n$ with replacement is $$\dfrac{S_2(k,u)\, n!}{n^k(n-u)!}$$ where $S_2(k,u)$ is a Stirling number of the second kind. So you could estimate $n$ by taking the value which maximises this likelihood for given $k$ and $u$. This will be the largest integer $n$ for which $n (n-1)^k \ge (n-u) n^k$

B. The expected number of unique values seen when picking $k$ from $n$ with replacement is $$E[u] = n\left(1-\left(1- \frac1n\right)^k\right) .$$ So you could estimate $n$ by taking the value which would give the observed value of $u$

As a numerical example, suppose you have $k=10$ samples and observe $u=6$ unique cases. Then method A could suggest an estimate of $n=8$ while method B would suggest an estimate of about $8.296887$

0
On

Additional notes to Henry's answer (I had to write it to understand the part B.):

Let $$\Omega = \{ (x_1, x_2, ..., x_k), \textrm { with } x_i \in [\![1,n]\!] \},$$ $$B_j= \{ (x_1, x_2, ..., x_k),\textrm{ with }x_i \in [\![1,n]\!] -\{j\} \}$$ and $$C_j= \{ (x_1, x_2, ..., x_k),\textrm{ with at least one }x_i = j \}.$$ Then $P(C_j)= 1 - P(B_j) = 1 - \frac{(n-1)^k}{n^k} = 1 - (1 - 1/n)^k$.

Now given $(X_1, X_2, ..., X_k)$ i. i. d. random variables (each one uniform on $[\![1,n]\!]$), let $U$ be the random variable giving the number of unique values among them.

With the notation $$Y_j = \left\{ \begin{array}{ll} 1 & \mbox{if } X_1 =j \mbox{ or }X_2 =j \mbox{ or ... or } X_k =j \\ 0 & \mbox{else} \end{array} \right.$$ we have: $U = \sum_{j=1}^n Y_j$ and:

$$E[U] = \sum_{j=1}^n E[Y_j] = \sum_{j=1}^n 0P(Y_j=0) + 1P(Y_j=1) = \sum_{j=1}^n P(C_j)=n(1 - (1 - 1/n)^k).$$