Optimal Edition from string $X$ to string $Y$

55 Views Asked by At

$\newcommand{\restrict}[2]{{#1}\restriction_{#2}}$ $\newcommand{\cardinal}[1]{\abs{#1}}$ $\newcommand{\abs}[1]{\left\lvert #1 \right\rvert}$ $\newcommand{\append}[2]{\operatorname{ap}\left(#1,#2\right)}$ $\newcommand{\domain}[1]{\operatorname{dom}\left(#1\right)}$ $\newcommand{\range}[1]{\operatorname{range}\left(#1\right)}$

I have been trying to figure out a proof for Levenshtein distance for a while. Essentially, this algorithm establishes a shortest path of edition from a string $X$ to string $Y$ using dynamic programming.

An edition can be $I_{n, v}$ (insertion of a character $v$ at the $n$-th place of a string), $D_{n}$ (deletion of the character at the $n$-th place of a string) or $S_{n, v}$ (substitution of the character at the $n$-th place of a string by $v$). A path of edition is a composition of such operations. For convenience, I take a path of edition as a sequence of tuples: $(I, n, v)$ (insertion), $\left(D, n\right)$ (deletion) and $\left(S, n, v\right)$ (substitution). Further, instead of writing $P\left(S\right)$ to denote application of path $P$ to sequence $S$, I write $a\left(P, S\right)$. Proof of Levenshtein distance can be reduced to proving the following lemma:

Let $m, n \in \mathbb{N}^{+}$ and $X$ and $Y$ be two sequences such that $\cardinal{X} = m$ and $\cardinal{Y} = n$. Let $P_{1}, P_{2}, P_{3}$ be shortest modification paths such that \begin{equation*} a\left(P_{1}, \restrict{X}{\left[1, m - 1\right]}\right) = \restrict{Y}{\left[1, n - 1\right]}, \end{equation*} \begin{equation*} a\left(P_{2}, \restrict{X}{\left[1,m - 1\right]}\right) = Y, \end{equation*} and \begin{equation*} a\left(P_{3}, X\right) = \restrict{Y}{\left[1, n - 1\right]}. \end{equation*} If $X\left(m\right) = Y\left(n\right)$, then a shortest modification path from $X$ to $Y$ is among the following three selections: $P_{1}$, $\append{P_{2}}{\left(D, n + 1\right)}$ and $\append{P_{3}}{\left(I, n, Y\left(n\right)\right)}$. If $X\left(m\right) \neq Y\left(n\right)$, then the shortest modification path from $X$ to $Y$ is among the following three selections: $\append{P_{1}}{\left(S, n, Y\left(n\right)\right)}$, $\append{P_{2}}{\left(D, n + 1\right)}$ and $\append{P_{3}}{\left(I, n, Y\left(n\right)\right)}$.

In the lemma, $\cardinal{\cdot}$ stands for cardinal (length), $\restrict{F}{S}$ means restriction of a function $F$ to a set $S$ and $\append{X}{Y}$ means "append $Y$ to $X$". To prove this lemma, I already proved the following lemma in advance from some hint I found:

Let $X$ and $Y$ be sequences of $V$ and $P$ is an optimal path of edition from $X$ to $Y$. Then for any $i \in \domain{P}$, there are the following possibilities:

(1) $P\left(i\right) = \left(D, k\right)$ for some $k$ such that $k \in \domain{a\left(\restrict{P}{\left[1,i - 1\right]}, X\right)}$ and $a\left(\restrict{P}{\left[1,i - 1\right]}, X\right)\left(k\right) = v$ for some $v \in \range{X}$ (deletion of a character in $X$);

(2) $P\left(i\right) = \left(I, k, v\right)$ for some $k$ such that $k \in \left[1, \cardinal{a\left(\restrict{P}{\left[1,i - 1\right]}, X\right)} + 1\right]$ and some $v \in \range{Y}$ (insertion of a character in $Y$);

(3) $P\left(i\right) = \left(S, k, v\right)$ for some $k$ such that $k \in \domain{a\left(\restrict{P}{\left[1,i - 1\right]}, X\right)}$ and $a\left(\restrict{P}{\left[1,i - 1\right]}, X\right)\left(k\right) \in \range{X}$ and $v \in \range{Y}$ for some $v \in Y$ (substitution of a character in $X$ by a character in $Y$).

The proof for the above lemma is very tedious and involves many technical details, and I do not wish to present it here. Anyone who is interested may ask for it, and I can forward him/her the latex source.

It is my understanding that induction has to be used to prove the first lemma I wish to prove. The case for $n = 0$ is trivial. I have also tested $n = 1$. However, I do not know how to proceed from $n$ to $n + 1$ as $n \geq 1$, $n$ being the length of $Y$. Can anyone provide thoughts or references?

1

There are 1 best solutions below

0
On

The best result along this direction is Garrett Wright's algorithm - take a look.

Improved Time Warp Edit Distance--A Parallel Dynamic Program in Linear Memory G Wright arXiv preprint arXiv:2007.16135