Can you help me understand the additivity property of IE? I think I may have found a violation but it may be that I'm just misunderstanding the maths. I'm attempting to achieve a meaningful understanding of information entropy and compression by studying the first 9 chapters of MacKay's book, Information Theory Inference & Algorithms. (You can get the free book here. I highly recommend it)
In Chapter 4 Mackay introduces the additivity property of Shannon Information for two independent random variables $x$ and $y$:
$H(x, y) = log_{2}\frac{1}{P(x,y)} = log_{2}\frac{1}{P(x)P(y)} = log_2\frac{1}{P(x)} + log_2\frac{1}{P(y)}$
In general we can write,
$H(X, Y) = H(X) + H(Y)$
This is pretty straightforward since the joint probability of two events is:
$P(x,y) = P(x|y)P(y)$
Where two events are independent if:
$P(x|y) = P(x,y)P(y) = P(x)$
So we write the joint probability as:
$P(x,y) = P(x)P(Y)$
However I found this property did not hold when I tried to work out one of the numerical examples he included in his text. It also did not hold on an artificial language I randomly generated.
MacKay creates a dictionary of the artificial language, Wenglish, by selecting $2^{15}$ words of 5 character length from the probability distribution of a-z including the space character '-'.
MacKay then claims that since Wenglish uses all of its words with equal probability that the average information content per character is 15 bits/5 chars = 3 bits / char.
Information is highly variable at the first character (low for a, high for z) "however a word is exactly 15 bits so the letters that follow an initial z have lower average information content per character than an initial a."
So I computed the character by character entropy of word examples he gave such as:
zatnt
zxast
odrcr
aztdn
And found that the summation of the individual character entropies did NOT add up to 15 bits. Does this constitute a violation of the additivity property of entropy or is it that on average it should be 15 and we can only assume additivity when the distributions for characters are uniform?
As requested, here are my calculations:
For word 'zatnt' with following probabilities:
[0.0007 0.0575 0.0706 0.0596 0.0706]
and log2 1/p entropy:
[10.4804 4.1203 3.8242 4.0685 3.8242]
with a sum:
26 ~= 15
Here are the probabilities:
a = .0575
b = .0128
c = .0263
d = .0285
e = .0913
f = .0173
g = .0133
h = .0313
i = .0599
j = .0006
k = .0084
l = .0335
m = .0235
n = .0596
o = .0689
p = .0192
q = .0008
r = .0508
s = .0567
t = .0706
u = .03334
v =.0069
w = .0119
x = .0073
y = .0164
z = .0007
'-' = .1928
There seems to be several problems here.
First, the entropy is, by definition, an average: $H(X) = E[-\log p(X)]$ You don't compute the entropy of a single realization - you can compute, yes, the "information content" as $-\log p(X)$
Second, the "Wenglish" language is defined in somewhat tricky terms, which can lead to confusion: it consists of a dictionary of $2^{15}$ words constructed "at random" according to the probabilities table you copied; then, the source consists of a device which chooses a word at random (with equal probability) from that dictionary. The confusion arises because the word "probability" is used here twice. In what regards to the entropy calculation, only the second is relevant, the dictionary should be assumed as given.
Now: given (fixed) a particular dictionary, (and knowing that words are extracted uniformly) to compute the probability that the first letter is 'a', we should not use the probabilities table, but simply count the ratio of words starting with 'a' over the total in the dictionary ($2^{15}$). True, this should be near $p_a$ in your table. But the succesive letters are not independent: to take the extreme case, if you are given the first four letters, the probability that the fifth is 'a' will not be $p_a$, it will probably be 0 or 1 (given the first four letters we almost always can deduce the fifth). Hence, the conditional entropies of successive letters are lower than the marginals. And this explains that the total entropy is $15$ bits (as it must be for a source that emits $ 2^{15} $equiprobable macro-symbols), which is much less than $5\times H(x_1)$.
In summary, when the dictionary is fixed the symbols are not independent anymore, and hence the additive property does not apply.