Let $k , n , s \in \mathbb{N}$. Let $X$ be a string of length $n$ selected uniformly at random from an alphabet of size $s$. What is the probability that $X$ contains a repeated substring of length $k$?
Below are some technical notes (feel free to disregard them) indicating my attempts to solve this problem that are inadequate:
- If substrings were nonoverlapping blocks starting and ending at indices that are multiples of $k$, then each substring would be independent, and this would reduce to the birthday problem. But overlapping substrings are dependent, so it's a harder problem.
- Mitzenmacher and Upfal (section 12.5.2) analyze a similar problem: given a fixed string $b$ of length $k$, upper bound the probability that its number of occurrences in $X$ is far from its expected value of $(n-k+1) / s^k$. They use martingales with the Azuma-Hoeffding inequality to overcome the fact that substring occurrences are dependent, but I don't see how to apply their technique to this problem.
- Trying to apply their bound directly to this problem doesn't appear to work very well: Let $\#(b,X)$ denote the random variable representing the number of occurrences of $b$ in $X$, so $\mathrm{E}[\#(b,X)] = (n-k+1)/s^k$. They show that for each $\epsilon$, $\Pr[\#(b,X) - \mathrm{E}[\#(b,X)] \geq \epsilon] \leq e^{-2 \epsilon^2/(nk^2)}$. But even if $\mathrm{E}[\#(b,X)]$ is very close to 0, if we choose $\epsilon = 2$ to bound the probability of $b$ occurring twice, we get $\Pr[\#(b,X) \geq 2] \leq e^{-8/(nk^2)}$, which is useless for almost all values of $n$ and $k$.
- Even if the previous bound $e^{-8/(nk^2)}$ were smaller, to use it to solve the original problem, one must then apply the union bound over all $s^k$ strings $b$ of length $k$ to get $\Pr[(\exists b \in \Sigma^k)\ \#(b,X) \geq 2] \leq s^k e^{-8/(nk^2)}$, but this is too weak a technique , in the same way that one would not solve the birthday problem by analyzing, for each fixed calendar date, the probability that it is picked twice.