Does the variation of information metric need to be defined on clusterings?

33 Views Asked by At

I read Meila's original paper where the variation of information metric is defined as $$\operatorname{VI}(X,Y) = H(X \mid Y) + H(Y \mid X),$$ where $X$,$Y$ are two clusterings of a dataset $D$. The proof uses the properties of the clusterings themselves.

However, the wiki page on Mutual Information lists $\operatorname{VI}(X,Y)$ as a metric (seemingly) on two arbitrary discrete random variables $X,Y$. Is VI a true metric for arbitrary discrete random variables?