Counting the number of identical sub-sequences in a larger sequence.

54 Views Asked by At

Suppose I am working with sequences like this (monotonic but not strictly monotonic where each member of the alphabet repeats an identical number of times):

$x = \{1,1,1,1,2,2,2,2,3,3,3,3\}$, then $|x| = N = 12$ and the alphabet is $\Sigma = \{1,2,3\}$

Ultimately I am interested in the contiguous identical sub-sequences in $x$ and a corrupted version of $x$ let's call it $x'$, and let's say it has a value:

$x' = \{1,1,2,2,2,3,3,3,3\}$.

With this I am hoping to measure the goodness-of-fit between the corrupted version and the reference version $x$ by looking at the distribution over the identical sub-sequences, instead of raw counts of the alphabet, since the sub-sequences carry information on the monotonic nature of both sequences. That is my intution at least (looking at the distribution of the alphabet would simply yield a uniform distribution, and so would loose the monotonic repeating trend).

As an example, suppose we simply just count the identical $n-$tuples (let's call these $a_n$) when $n=2$, then we simply count all the identical tuples which:

  1. are exactly of length 2 i.e. $|a_2| = 2$.
  2. have all symbols in $a_2$ identical i.e. $a[0]=a[1]$.

There are three 2-tuples (i.e. three 2-tuples which look like so: $[1,1]$) in the first sub-sequence of ones, three 2-tuples in the second sub-sequence of twos and the same for the 2-tuples in the last sub-sequence of threes. In total there are 9 2-tuples in $x$.

Thus the probability of finding an identical 2-tuple in $x$ is:

$$P[a_2] = \frac{1}{9}$$

(not entirely sure this is right, I am primarily interested in counting the number of identical $n-$tuples).

Hence, now I would like to extend this so that I get a PMF for all the available tuples i.e. for $n \in \{2,3,4\}$ as $4$ is the maximum identical tuple we can consider given what $x$ looks like.

The histogram for this problem:

$n=2 : \#9$ $n=3 : \#6$ $n=4 : \#3$

Possible solution:

$f$ : identical sub-sequence count

$f(n) \triangleq N - (n-1)|\Sigma|$

Is it as simple as this? Though of course, if you construct the PMF from this, it does not sum to 1.