How to estimate a vocabulary size?

339 Views Asked by At

I have a list of the 1 million most common English words ordered by number of times they appear on all books in Google Books. I want for the user to select from a list of 100 words (small sample) every word he knows.

I want to calculate how many words the user probably knows from the list, extrapolating from his answers in the small sample.

Supposing the sample is every 10th word from the 10th to the 1000th, what mathematical formula would give me the best result?

1

There are 1 best solutions below

5
On

Assuming that the probability that a user knows some word $w_{1}$ is equal to the probability that the user knows some word $w_{2}$ (and that these events are independent) for any two words $w_{1}$ and $w_{2}$, if a user knows $a$ out of $100$ words, a reasonable answer would be that they know $\frac{a}{100}n$ words out of an list of length $n$.

$n=1000$ in your example.

Also, since you have no beliefs about word knowledge, I don't know why you are sampling every 10th to make your list of 100.