Improving the Edit Distance Algorithm

1.5k Views Asked by At

I applied an Edit Distance Algorithm for similarity between two strings over the lowercase latin alphabet, where the first string has length $m$ and the second length $n$.

However I want to improve it so that i get $O(n\log(n))$ solution or something less than $O(mn)$. My string length can be $100000$.

Any suggestion or method ?

#include<stdio.h>
#include<string.h>
int minm(int x,int y,int z){
int t= x<y?x:y;
 return t<z?t:z;


 }

int edit(int x,int y,int b,char s1[],char s2[]){

    if(s1[x]==s2[y])
        return 0;
    else
        return b;


}

int main(){

   int t,i,j,m,n,b,k,a;
   char s1[100000],s2[100000];
   scanf("%d",&t);
   while(t--){
         scanf("%s%s",s1,s2);
         scanf("%d%d%d",&a,&b,&k);
          m=strlen(s1);
          n=strlen(s2);
          int ar[m+1][n+1];
          for ( i = 0; i <=n; ++i)
          {
            /* code */
            ar[0][i]=i;
          }

          for ( i = 0; i <=m; ++i)
          {

            ar[i][0]=i;
          }


         for ( i = 1; i <=m ; ++i)
         {
            for ( j = 1; j <=n; ++j)
            {
                ar[i][j]=minm(ar[i][j-1]+a,ar[i-1][j]+a,ar[i-1][j-1]+edit(i-1,j-1,b,s1,s2));
            }

         }

       if(ar[m][n]<=k)
          printf("%d\n",ar[m][n]);
         else
            printf("-1\n");



   }






return 0;
}

Cost of Addition and Removal of letters is $a$ and cost of replacement is $b$. Optimal cost is $K$. Construed as a decision problem, if least cost is greater than $K$, I print -1 (i.e cost $K$ or less is infeasible).

Here my 2d array is quite expensive for string length of $100000$. $O(mn)$ is pathetic for $n=m\ge 100000$. So I need to optimise it so that I could reduce nested loops in some order of $\le 10^6$.

1

There are 1 best solutions below

4
On BEST ANSWER

As far as I know, even for the standard Levenshtein distance (where $a = b = 1$), there is no known algorithm to do this faster than $O(nm)$. See here.

Thus, instead of trying to optimize it better than $O(nm)$, I would optimize it based on the value of $K$. I am pretty sure $K$ must be very small in your case.