In his paper Prediction and Entropy of Printed English Shannon defines the entropy $H$ of a language as
$$H = \lim_{N \to \infty} F_N$$
where $$F_N = \sum_{i, j} p(b_i, j) \log p(j | b_i)$$
where $b_i$ is a sequence and $j$ is a word.
The interpretation is: $F_N$ computes the surprise of the average continuation $j$ of a sequence $b_i$ (of length $N-1$) weighted by the probability of such a sequence $(b_i, j)$ occurring in the first place.
Shannon further states that
$$F_N = \sum_{i, j} p(b_i, j) \log p(j | b_i) = \sum_{i, j} p(b_i, j) \log p(b_i, j) + \sum_i p(b_i) \log p(b_i)$$
We arrive at that equivalence as follows:
\begin{align} \sum_{i, j} p(b_i, j) \log p(j | b_i) & = - \sum_{i, j} p(b_i, j) \log \frac{p(b_i, j)}{p(b_i)}\\ & = - \sum_{i, j} p(b_i, j) ( \log p(b_i, j) - \log p(b_i) ) \\ & = - \sum_{i, j} p(b_i, j) \log p(b_i, j) - p(b_i, j) \log p(b_i) \\ & = - \left(\sum_{i, j} p(b_i, j) \log p(b_i, j) - \sum_{i, j} p(b_i, j) \log p(b_i)\right) \\ & \text{marginal distribution in second sum}\\ & = - \sum_{i, j} p(b_i, j) \log p(b_i, j) + \sum_{i} p(b_i) \log p(b_i) \\ \end{align}
My question was about the interpretation of the formula. I answered it myself while writing the question:
But since references to this paper are sparse, and I only found a very badly scanned version that is hard to read, maybe this helps somebody.