What does the semicolon mean in $I(X;Y)$ (mutual information)?

811 Views Asked by At

$I(X;Y)$ is the symbol for mutual information, where $X$ and $Y$ are random variables. It measures the co-dependence between $X$ and $Y$ including non-linear interaction, whereas linear correlation is written as $\rho(X,Y)$.

Why does mutual information have a semicolon between its inputs, whereas the arguments for correlation only have a comma? How incorrect is it to write $I(X,Y)$ instead?

2

There are 2 best solutions below

7
On BEST ANSWER

You asked two questions, let me answer them both

Why do we use semicolons for mutual information?

While with two variables it would be fine to write a comma (people would understand what you mean), you would get into an issue pretty soon when you are dealing with vectors, or mutual information between sets of random variables. Most notably, in many inequalities one uses in information theory (say Fano's, or channel capacity calculations), you will encounter mutual information terms like

$I(X; Y,Z)$ and $I(X_1, X_2; Y_1, Y_2)$

and you "can" of course write them as

$I(X, (Y,Z))$, and $I((X_1, X_2), (Y_1, Y_2))$

but it is that much uglier.

Personally, I like the semicolon since it emphasizes the two halves of mutual information; it is not a function that takes more than two inputs (usually; there is a multi-way information, but that is much less used), and in English, when we list more than two objects, we tend to use commas and not semicolons as separators (unless you are listing sentences, but you should probably just divide that up then). The main issue I have with it is that in statistics, semicolons sometimes are used to denote parameters (more so in frequentist setting; in Bayesian, you usually use a $\mid$ for a conditional observation), but overall, I like this notation.

How incorrect is it?

It is incorrect. Don't do it. If you really want to use this notation, define in the beginning of your document I(X, Y) := I(X; Y), but please don't do it.

2
On

$I(X;Y) = I(Y;X)$, so technically you should be fine in writing $I(X,Y)$. However, based on what I gather from wiki (https://en.wikipedia.org/wiki/Mutual_information), $I(X;Y)$ could be used to emphasize the second interpretation in terms of KL divergence which highlights the asymmetry introduced by conditioning over one of the two variables (Ref: second expression in the Relation to Kullback–Leibler divergence section)