devising a code for five symbols

113 Views Asked by At

Problem: You need to send messages in the five symbols A, B, C, D, E. The frequencies of each symbol are in this order from most frequent to least frequent. Devise a code that will make the average length of messages as short as possible.

I know that A should have the shortest code, B the next shortest and so on with E having the longest code.

But I don't know what other constraints there are. For example, how should I make it so that the code can be decodable?

In the same chapter, an example for encoding the genetic code was given, with the assignment as follows:

A 0

C 10

T 110

G 111

(order of frequency is A, C, T, G for genes) Is there a way to turn this into a 5-symbol code?

2

There are 2 best solutions below

4
On

You need to use what is known as a prefix code, meaning that no code is a prefix for another code. For example, $\{1,61,766,7716\}$ is a valid prefix code because no single code is a prefix to another. An example of an invalid prefix code is $\{1,61,617,1776\}$, because $1$ is a prefix to $1776$, and $61$ is a prefix to $617$.

One simple solution to your problem is to do the following:

$$A\space{0}$$ $$B\space{10}$$ $$C\space{110}$$ $$D\space{1110}$$ $$E\space{11110}$$

This is a valid prefix code. The more frequently a character is used, the less bits it uses. But, if you wanted to devise a more precise code (for example, A has a frequency "score" of $8$, B has a "score" of $6$, C has a "score" of $5$, etc.) a Huffman code can be built using a binary tree.

0
On

The assumption that needs to be made to use a coding like that applied to the nucleotide bases is that not only are the probabilites decreasing from A to E, but they are decreasing a lot. Otherwise, you need to consider the probabilities explicitly. I'll call A,B,C... just $X_1, X_2, X_3...$ for simplicity.

But, if $p(X_i) \gg p(X_{i+1})$ then it is always beneficial to make the more probable one have relatively fewer characters if possible. For example, for $A,B,C,D$ we could choose 00,01,10,11, (this is efficient sometimes, and the usual choice with equal probabilities) but if $A$ was probable enough, it would be advantageous to use something where a is just one digit for A, even if it means the others must have more, for example A=0, B=100, C=110, D=111 would be better. In the extreme, if $A$ was really, really, really, probable ($p(A)\rightarrow\infty$), then the efficiency would be completely dominated by the number of characters used to represent $A$, regardless of the others.

If you apply the idea that $X_i$ is much more probable than $X_{i+1}$ to all the symbols, then you get:

$\begin{array}{cc}\text{Symbol}&\text{Code} \\ A & 0 \\ B & 10 \\ C & 110 \\ D & 1110 \\ E & 1111 \end{array}$

Much like the DNA code. Think of each string of $1$s representing a number by their length, and the $0$ signifying the end of it. For the longest one, just having 4 $1$s is enough to indicate the end of the number.


Without thinking about it more, I'm not sure of the exact conditions for this to apply. Though probably it is something like $p(X_i)/p(X_{i+1}) > 2$ (I haven't checked it is correct at all!!!!!)