Suppose I have a set of alphabet $\Sigma = \{ A,B,\ldots,Z \}$ with $|\Sigma|=26$. I am generating a set of $n$ random strings $S$, with each string of length $k$ from the alphabet (for the random string, each letter in a position is picked with probability $p=\frac{1}{|\Sigma|}$).
I want to find out:
- the expected length of longest common prefix
- the maximum length of longest common prefix with high probability.
I am working on a radix-tree based algorithm and want to justify the expected length of search operation theoretically.
Suppose we are selecting $n$ strings of length $k$ over an alphabet of $m$ letters. Classifying by the length $q$ of the longest common prefix we obtain for the total count of strings
$$m^k + \sum_{q=0}^{k-1} m^q (m^n - m) m^{n(k-1-q)}.$$
This simplifies to
$$m^k + (m^n - m) m^{n(k-1)} \sum_{q=0}^{k-1} m^q m^{-nq} \\ = m^k + (m^n - m) m^{n(k-1)} \sum_{q=0}^{k-1} m^{q(1-n)} \\ = m^k + (m^n - m) m^{n(k-1)} \frac{m^{(1-n)k}-1}{m^{1-n}-1} \\ = m^k + (m^n - m) m^{n(k-1)} \frac{m^{k-(k-1)n}-m^n}{m-m^n} \\ = m^k - m^{n(k-1)} \times (m^{k-(k-1)n}-m^n) \\ = m^k - m^k + m^{nk} = m^{nk}.$$
We have all of them and the sanity check goes through. The trick here was that we can view these $n$ strings as a rectangular array with $n$ rows and $k$ columns. Supposing that the first $q$ columns are the same the next column must not consist of $n$ identical letters or else the common prefix would extend to include it. That means we have $m^n-m$ choices for the letters in said column (there are $m$ strings of length $n$ containing a single letter.)
Now for the expectation we get
$$k m^k + \sum_{q=0}^{k-1} q m^q (m^n - m) m^{n(k-1-q)} \\ = k m^k + (m^n - m) m^{n(k-1)} \sum_{q=0}^{k-1} q m^q m^{-nq} \\ = k m^k + (m^n - m) m^{n(k-1)} \sum_{q=0}^{k-1} q m^{q(1-n)}.$$
Now we have
$$x\left(\sum_{p=0}^{k-1} x^p\right)' = \sum_{p=0}^{k-1} p x^p = x \left(\frac{x^k-1}{x-1}\right)' = x \frac{k x^{k-1}}{(x-1)} - x \frac{x^k-1}{(x-1)^2}.$$
Apply this to the sum to get two terms, the first of which is
$$k m^k + (m^n - m) m^{n(k-1)} m^{1-n} \frac{k m^{(1-n)(k-1)}}{m^{1-n}-1} \\ = k m^k + (m^n - m) m^{n(k-1)} m \frac{k m^{(1-n)(k-1)}}{m-m^n} \\ = k m^k - m^{n(k-1)} m k m^{(1-n)(k-1)} = 0.$$
The second is
$$- (m^n - m) m^{n(k-1)} m^{1-n} \frac{m^{(1-n)k}-1}{(m^{1-n}-1)^2} \\ = - (m^n - m) m^{n(k-1)} m^{n+1} \frac{m^{(1-n)k}-1}{(m-m^n)^2} \\ = m^{n(k-1)} m^{n+1} \frac{m^{(1-n)k}-1}{m-m^n} \\ = m^{nk} \frac{1-m^{(1-n)k}}{m^{n-1}-1}$$
Divide by $m^{nk}$ to get for the expected longest common prefix
$$\frac{1-m^{(1-n)k}}{m^{n-1}-1}.$$
The following Maple code was used to verify this formula.
Q := proc(n, k, m) option remember; local d, ind, pref, pos, res, idx1, idx2; res := 0; for ind from m^(n*k) to 2*m^(n*k)-1 do d := convert(ind, base, m); for pref from 0 to k-1 do for pos from 1 to n-1 do idx1 := (pos-1)*k + pref; idx2 := pos*k + pref; if d[idx1+1] <> d[idx2+1] then break; fi; od; if pos < n then break; fi; od; res := res + pref; od; res/m^(n*k); end; X := (n, k, m) -> (1-m^((1-n)*k))/(m^(n-1)-1);