What is the variance of the number of occurrences of a subsequence in a random sequence.

41 Views Asked by At

Let $N_n$ be a random string of length $n$, where each of the $n$ characters in $S_n$ is independently chosen with uniform probability from the set $\mathcal{S} := \{s_1, \ldots, s_K\}$. Here $K$ is FIXED. Now consider there is a fixed string of length $m = m(n)$ also from $\mathcal{S}$, we denote it as $N_m^{\star}$. Now we want to know the variance of the number of occurrences of $N_m^{\star}$ appearing in $N_n$. Here $n \rightarrow + \infty$.

The following is my try: Letting $X$ be the number of occurrences of $N_m^{\star}$, we can write $$ X =\sum_{i=1}^{n - m - 1} X_i, $$ where $X_i$ is an indicator random variable, equal to one when the $i$-th subsequence is $N_m^{\star}$, and zero otherwise. Then $X_i$ is Bernoulli distributed with successful probability $K^{-m}$. Thus, we have $$ var(X_i) = K^{-m}(1 - K^{-m}), $$ and $cov(X_i, X_j) = 0$ if $|j - i| > m$ as $X_i$ is independent with $X_j$. But when $|i - j| \leq m$, I only obtain that $E [X_i X_j ] \leq K^{-(k + m - 2)}$ with $j = i + k$. Then $$ \begin{aligned} cov(X_i, X_j) & = E [X_i X_j ] - E X_i E X_j \\ & \leq K^{-(k + m - 2)} - K^{-2m} \\ & = K^{-2m} \big( K^{m + 2 - k} - 1\big) . \end{aligned} $$ Then I got that the variance can be bounded by $$ \begin{aligned} & var \big(X \big) \\ = & \sum_{i = 1}^{n - m + 1} var(X_i) + \sum_{1 \leq i \neq j \leq n - m + 1} cov(X_i, X_j) \\ \leq & (n - m + 1) K^{-m}(1 - K^{-m}) + \sum_{k = 1}^m \sum_{|j - i| = k} K^{-2m} \big( K^{m + 2 - k} - 1\big). \end{aligned} $$ But I find this bound is too loose. There is an alternative approach for bounding this variance more tightly?

1

There are 1 best solutions below

1
On BEST ANSWER

The variance is maximized when all possible covariance terms are maximized, which happens when $N_m^{\star}$ consists of the same character $s_{\star}$ repeated $m$ times.

In this case, for $1 \le k < m$, $X_i X_{i+k}$ is $1$ if we see $m+k$ consecutive instances of $s_{\star}$, so $\mathbb E[X_i X_{i+k}] = K^{-(m+k)}$ and $\operatorname{Cov}[X_i,X_{i+k}] = K^{-(m+k)} - K^{-2m}$. So the bound in the question is essentially correct, except I'm not sure where the extra factor of $K^2$ came from. We get $$\operatorname{Var}[X] = (n-m+1)(K^{-m}-K^{-2m}) + 2\sum_{k=1}^m (n-m-k+1) (K^{-(m+k)} - K^{-2m}).$$ The sum can be exactly computed, but the result is not very enlightening. A simple upper bound that essentially captures the truth is $(n-m+1) K^{-m} \cdot \frac{K+1}{K-1}$, which is to say: only a small constant factor greater than the sum of the variances of the $X_i$.