How to capture "order" of a sequence?

58 Views Asked by At

I am trying to solve something akin to a "featurization" problem. This is the problem:

Let $S$ be a string of characters. So S can be $AAABBB$, $AABBAB$, etc. If I want to featurize a string $S$, the length of the string $N$, and its composition entropy $H$ are good starters. however, evidently, this is not bijective. So, I added the histogram of $A, B$ chunks to the feature space. This histogram tells me how many $n$-unit chunks of $A$,$B$ exist in my string. Given this histogram, I was wondering how I can put these chunks together to make my string. Evidently, this histogram can be arranged in different manners, to give different strings. Furthermore, I know there is a finite number of such strings: $ {p+q \choose p}/2$ (divided by 2 because if a chain's order is reversed, it is still the same chain: $ABABB = BBABA$).

For example, $AAABBB$ has $N=6$, $H = 1$, and $\mathcal{H}_A$ is the histogram of $A$ chunks, which has $1$ chunk of length 3, while $\mathcal{H}_B$ is the histogram of $B$ chunks, which has $1$ chunk of length 1. If I was simply given the features, $N$ and the histograms, I can uniquely create the sequence $AAABBB$.

However, this process breaks down when I have something like $S'=BBAAABBAABBAAAAAABB$. Because I have no notion of order in the histograms, I cannot uniquely recreate $S'$.

My question is this: what additional input is required to capture the order in which the sequence is made? Is this even possible?

ADDENDUM: The problem I am thinking about is taking derivatives wrt to "sequence", in order to perform search in sequence space. I have a generative function $F$ that I am trying to optimize with respect to sequence. $F(S) = f$. Now, if I want to optimize $f$, I would need to take gradients "with respect to sequences". If I want to search in sequence space, I need derivatives, where $$\frac{\partial F(\{t\})}{ \partial t_i} = \frac{dF}{dS}\frac{\partial S}{\partial t_i} = 0$$