Convert string to another string over a smaller alphabet, and vice versa.

132 Views Asked by At

I'm trying to find the most suitable algorithm to convert a string $\alpha$ over the alphabet $\Sigma$ of size $| \Sigma | = n$ to string $\beta$ over the alphabet $\Omega$ of size $| \Omega | = n-1$, and vice versa. By suitable I mean the balance of algorithm speed, implementation complexity, and overhead of the string length $| \beta | - | \alpha |$.

I've found several algorithms, but maybe there is some unknown to me. Perhaps you know such algorithms. Here's what I found:

  1. Conventional byte (character) stuffing. Pretty simple, but worst-case overhead is 200%.
  2. Consistent overhead byte stuffing (COBS). Also quite easy to implement. Worst-case overhead is much better than conventional byte stuffing.
  3. Think of the input string as a long number of base $n$, convert this number to base $n-1$. Very complex and computational heavy as compared with previous methods, but worst-case overhead is better if I'm not mistaken.

Anything else?

2

There are 2 best solutions below

1
On

I'm not sure if this counts, but it's an idea you might be interested in. In real world applications, you can identify a string in the $\Omega$ alphabet that would never ever ever arise in a real message. Neither in the $\Omega$ alphabet, nor in the $\Sigma$ alphabet after the simple embedding from $\Omega$ to $\Sigma$. Then you can use that string to represent one letter from $\Sigma$.

Let's say $\omega$ is such a string, something like "fhqwhgads" in English, although you could get away with something shorter. For notation, let $$\Sigma=\{s_0,s_1,\ldots, s_{n-1}\}\qquad\Omega=\{w_1,w_2,\ldots,w_{n-1}\}$$

Then map $$\DeclareMathOperator{\strings}{strings}f:\strings(\Sigma)\to\strings(\Omega)\quad f(s_i)=w_i\text{ when }i>0\quad f(s_0)=\omega$$ The inverse map would first examine a string $\beta$ for instances of $\omega$ and convert them to $s_0$, and then convert remaining $w_i$ to $s_i$. This relies on you knowing that $\omega$ would never practically arise in a real message from either alphabet.

1
On

This is probably not what you are looking into (this a math site, not a programming site), but anyway: a simple conventional byte stuffing algorithm with some input shifting (each time we need to "escape", we change the offset, so as to miminize the probability of worst-case scenarios). Overhead (average) should be below 1%. Tested.

Encoder:

/********************  encode. c **************************/
#include<stdio.h>

#define N 255                   //restricted alphabet 
#define NP1 (N+1)

#define DISALLOWED_CHAR 0 // only restrction: these three must be different 
#define ESCAPE_CHAR     1
#define ESCAPE_CHAR2    2

#define STEP            7 // any number, preferably coprime with NP1

int offset = 0;

void txraw(int c) {
    putchar(c);
}

void tx(int c) {
    c = (c + offset) % NP1;
    if (c == DISALLOWED_CHAR || c == ESCAPE_CHAR) {
    txraw(ESCAPE_CHAR);
    txraw(c == ESCAPE_CHAR ? ESCAPE_CHAR : ESCAPE_CHAR2);
    offset = (offset + STEP) % NP1;
    } else {
    txraw(c);
    }
}

int main(void) {
    int c;
    while ((c = getchar()) >= 0) {
    tx(c);
    }
    return 0;
}

Decoder:

/********************  decode. c **************************/
#include<stdio.h>

#define N 255                   //restricted alphabet 
#define NP1 (N+1)

#define DISALLOWED_CHAR 0 // only restrction: these must be different 
#define ESCAPE_CHAR     1
#define ESCAPE_CHAR2    2

#define STEP            7

int offset = 0;
int pendingEsc = 0;

int decode(int c) { // -2 if more to read
    int oo = offset;
    if (pendingEsc) {
        c = c  != ESCAPE_CHAR ? DISALLOWED_CHAR : ESCAPE_CHAR;
        pendingEsc = 0;
        offset = (offset + STEP) % NP1;
    } else {
        if (c == ESCAPE_CHAR) {
            pendingEsc = 1;
            return -2;
        }
    }
    return (c - oo) % NP1;
}

int readx() { // -2 if more to read
    int c = getchar();
    if (c == -1)       return c;
    else        return decode(c);
}

int main(void) { // decodes stdin -> stdout
    int c;
    while (1) {
        c = readx();
        if (c >= 0)
            putchar(c);
        else if (c == -1)
            break;
    }
    return 0;
}