I recently started reading some information theory texts and was immediately struck by the strangeness of the syntactic choices for some basic concepts. For example:
- KL-divergence is written as $D(P||Q)$ (roughly; I can't even figure out how to write the double bars appropriately in TeX) instead of $D(P, Q)$
- Mutual information is often written as $I(X; Y)$ instead of $I(X, Y)$
There may be more examples that I haven't encountered yet. I'm curious if anyone has (historical) insight into why this came to be.
Regarding the mutual information, there are cases when one is interested in the mutual information between sets of random variables, say, between $\{X_1, X_2\}$ and $\{Y_1, Y_2,Y_3\}$. In that case, I guess the notation $I(X_1,X_2;Y_1,Y_2,Y_3)$ is more convenient than $I(\{X_1, X_2\} , \{Y_1, Y_2,Y_3\})$
P.S.: you can write
$\|$in Tex for double bars.