Expected value of #correct tokens when generating a sequence

29 Views Asked by At

Let's say I've been given a sentence $\mathcal{S}$of $n$ words. I have a vocabulary $\mathcal M $ of $m$ words. If I sample $n$ words by picking at random from $\mathcal {M} $ successively what's the expected number of places where the generated words agree with those from $\mathcal{S}$?

  1. If I sample with replacement
  2. If I sample without replacement
2

There are 2 best solutions below

0
On

I've come up with the following expressions. Wanted to check if they are correct.

\begin{gather} \sum_{\ell=1}^n \ell \binom n\ell \left(\frac 1m\right)^\ell \left( \frac{m-1}m \right)^{n-\ell} \tag{with replacement} \\ \sum_{\ell=1}^n \ell \binom n\ell \prod_{j=0}^{\ell-1} \frac{1}{m-j} \prod_{i=0}^{n-\ell-1} \frac{-1+m-\ell-i}{m-\ell-i} \tag{without replacement} \end{gather}

10
On

Each word has a probability of $\frac 1m$ of matching. Hence the expected number of matches is $\frac nm$. As expectation is linear (with no assumption on independence), this is the answer regardless of whether you replace or not.

Just as an illustration: Suppose $m=2=n$. Let's suppose the words are $A,B$ and that the given sentence $S$ is $AB$.

With replacement: the random sentence can be $\{AA,AB,BA,BB\}$ each with probability $\frac 14$. The respective match scores are $\{1, 2,0,1\}$ so the expected number of matches is $\frac 14\times \left(1+2+0+1\right)=1$.

Without replacement: the random sentence can be $\{AB,BA\}$ each with probability $\frac 12$. The respective match scores are $\{2,0\}$ so the expected number of matches is $\frac 12\times \left(2+0\right)=1$.