There are many algorithms (like Huffman, Arithmetic) which exploit the redundancy in the source message stream and compress the source symbols before sending it over (noisy/noiseless) channel to the receiver.
The information theory resources assume that the source symbols are known (given). Then, they explain various algorithms which take advantage of the uneven symbol probabilities in the source message to construct the final code.
There are two steps for encoding a source message
- Identify source symbols (mostly a set of ascii characters)
- Construct a code using source symbol probabilities (Huffman, Arithmetic etc)
My question is regarding step 1.
How can we identify the set of source symbols?. For instance one can simply use ascii set as source symbols, but can we do better than this? Is there any algorithm to identify the source symbol (not simply ascii characters) and then encode so that overall code has minimum length? Just as variable length codes, can we have variable length symbols?