Shannon's Desiderata

92 Views Asked by At

I'm currently reading James Stone's Information Theory: A Tutorial Introduction in which he says that,

Shannon knew that in order for a mathematical definition of information to be useful it has to have a particular minimal set of properties.

He then goes on to say these are:

  • Continuity: the amount of information associated with an outcome increases or decreases continuously (i.e., smoothly) as the probability of that outcome changes
  • Symmetry: the amount of information associated with a sequence of events does not depend on the order in which the outcomes occur
  • Maximal value: the amount of information associated with a set of outcomes cannot be increased if those outcomes are already equally probable
  • Additive: the information associated with a set of outcomes is obtained by adding the information if individual outcomes

What isn't clear to me, however, is why these criteria must be fulfilled for the mathematical definition of information to be useful. How did Shannon come up with these? Why specifically is each required? I'm sure there is important rationale for all of them, but to a layman like myself it seems a smidge arbitrary.

1

There are 1 best solutions below

0
On

While I cannot be certain I believe Shannon came upon these ideas by studying a toy problem. To explain why I think that we'll go through some simple examples ourselves to justify each criteria. We'll consider natural language to be information and consider communicating with each other by writing, so with strings just like we're actually doing.

For continuity if we consider "I love math" as our string. However say I swap a character so that it's "I lxve math" which has less information. It's close to two valid strings, "I love math" and "I live math" so it has almost perfect information. That means I should expect the information content of "I lxve math" to be very close to those of the valid strings. In fact, most English speakers would know that it probably means "I love math" so its information content should be really close to that of "I love math". This justifies continuity.

For symmetry we'll send two strings, the first will be "I love math" and the second will be "Shannon built it". Here, regardless of which sentence I transmit first we can perfectly understand what was written. The information content didn't change just because one message came before the other. That's what's meant by symmetry and so it's justified.

Maximal value is a little more complex but the same examples from continuity and symmetry. In the symmetric example expect both of those valid sentences to be possible as we have no reason to prefer one message to the other. That makes them equiprobable as messages. Since both have perfect information we can see that there is no way to increase the information content. However with "I lxve math" the probability would be less than a valid sentence. How can we improve the information content of the string? Well we fix it by replacing "x" with "o" or "i" which would increase the information content. However we're not back to perfect information and the strings are equiprobable. This motivates the maximal value property.

Finally we have additivity. Let's consider the two examples from symmetry again but make them grammatically correct by concatenating the string ". " to the end of each. We have two distinct messages with perfect information content. However I could instead transmit a single longer string with both sentences in them. That new message will exactly the same information content as the two separate messages. Alternatives imagine you asked four distinct questions for each property rather than one with all four. If I answered all four separately you'd still have the same information. This justifies additivity.

So all four of these properties hold for this familiar example. This is a good reason to believe the might hold but we should still take a moment to see if it really can be generalized. For example in practice we're going to use binary strings rather than natural language to communicate in an electronic system. If we could show that these two spaces were similar we can justify abstracting it more easily. To do that we can simply consider an encoding. We choose certain binary strings to match with certain letters and perhaps some additional information such as a headers and footers to help us understand the context and extent of the information. Since this encoding is also just a string all of the previous arguments work as well. This justifies generalizing.

As final note, in the spirit of toy problems, let give you some things to think on. Consider what these rules mean for the double encoding which just sends the same string sent twice. Two themes in coding theory will be to detect and correct errors. How do these rules work in that context? Can we use this information to detect or correct our errors? What would happen if you abstracted the double encoding to send the same message thrice?

I hope this helps and that you find as much joy in information theory as I have.