Finding if two sequence M and N are in the shuffle product of W and Z while not knowing W and Z

127 Views Asked by At

I have to solve this problem using dynamic programming but I'm stuck.

We have to sequence W and Z. Let's assume for now that:

W = ACA and Z = GT their shuffle product is W ⧢ Z = {GTACA, GATCA, GACTA, GACAT , AGTCA, AGCTA,AGCAT , ACGTA, ACGAT , ACAGT}

Now the actual problem is that if I'm given sequence M and N (we know them) I have to determine if they are in the shuffle product of W and Z (Which we now consider unknown). Also if they are indeed in their shuffle product my algorithm must find W and Z.

I don't know where to start. How can I even do this while not knowing W and Z?

There is also this additional information.

W is of length k

Z is of length l

M is of length k+l = n

N is of length n. (Both M and N have the same length)

Even with this I don't know where to start.

1

There are 1 best solutions below

0
On

There may be a more efficient way to do this, but here’s one way.

Go through $M$ and $N$ in parallel from beginning to end and in each step keep track of all pairs of initial segments of $W$ and $Z$ that would be compatible with what you've seen, and of the corresponding positions up to which $M$ and $N$ have followed $W$ and $Z$.

For instance, if $M$ is GTACAand $N$ is GATCA in your example, you’d have the following six active hypotheses after processing the first two letters:

\begin{array}{c|c} W&Z&M\text{ in }W&M\text{ in }Z&N\text{ in }W&N\text{ in }Z\\\hline \text{GT}&\text{GA}&2&0&0&2\\ \text{GA}&\text{GT}&0&2&2&0\\ \text{GA}&\text{T}&1&1&2&0\\ \text{T}&\text{GA}&1&1&0&2\\ \text{GT}&\text{A}&2&0&1&1\\ \text{A}&\text{GT}&0&2&1&1\\ \end{array}

The hypotheses that are active when you reach the end are the solutions.

Here’s Java code that implements this algorithm and applies it to $M$ and $N$ randomly shuffled from randomly generated sequences $W$ and $Z$. If you only have $4$ letters, there are lots of possibilities. I did a dozen runs with $k=20$ and $l=30$; the results varied from $10$ solution pairs to $360356$ solution pairs, with half of the runs producing more than $100000$ solution pairs and one run running out of memory because there were more than $10^8$ hypotheses to keep track of. Longer sequences (e.g. $k=30$ and $l=40$) produce too many hypotheses to handle; the number of active hypotheses grows roughly exponentially until you’re about halfway through. If you have more letters, things are a bit better, but not much. So this will only work if you have medium-sized sequences to compare, or if you find a way to optimize it to avoid the exponential growth. However, given that there are typically hundreds of thousands of solutions, it seems likely that any algorithm that finds all solutions will be expensive.

If, on the other hand, you’re content with finding one of the solutions, the fact that there are typically so many of them turns into an advantage. You could then use some standard combinatorial optimization approach (e.g. simulated annealing) to find one of them: Start with some random assignment of the positions in $M$ and $N$ to $W$ and $Z$ and use the number of mismatches as the objective function. With so many global optima to find, a good combinatorial optimization approach should find one of them efficiently.