I read Meila's original paper where the variation of information metric is defined as $$\operatorname{VI}(X,Y) = H(X \mid Y) + H(Y \mid X),$$ where $X$,$Y$ are two clusterings of a dataset $D$. The proof uses the properties of the clusterings themselves.
However, the wiki page on Mutual Information lists $\operatorname{VI}(X,Y)$ as a metric (seemingly) on two arbitrary discrete random variables $X,Y$. Is VI a true metric for arbitrary discrete random variables?