Given two random variables $X,Y$, their common information $X\wedge Y$ is defined in (Wolf and Wultschleger 2004) as the random variable $X\wedge Y=f_X(X)=f_Y(Y)$ constructed as follows:
- Let $G$ be the graph whose set of vertices is the union of the possible events for $X$ and $Y$, which we write as $\mathcal X\cup\mathcal Y$. Let's say that any two vertices $x\in\mathcal X$ and $y\in\mathcal Y$ of this graph $G$ are connected by an edge if $P_{XY}(x,y)>0$.
- Let $f_X:\mathcal X\to 2^{\mathcal X\cup\mathcal Y}$ be the function sending each vertex $x\in\mathcal X$ to the connected component of $G$ containing $x$. In words, this sends each input event into the full set of vertices of $G$ which are (not necessarily directly) connected to $x$. Same for $f_Y:\mathcal Y\to 2^{\mathcal X\cup\mathcal Y}$.
This random variable is used to discuss zero-error transmission rates and other things.
The authors go on to show that we have the identity $$I(X:Y) = H(X\wedge Y) + I(X:Y|X\wedge Y),$$ and to prove this they use the simpler identity $H(X)=H(XC)$, with $C\equiv X\wedge Y$. I'm trying to get a better understanding of why $H(X)=H(XC)$ holds. It seems like a simple enough statement, but I'm not seeing how it's proved, probably because I haven't really fully understood the intuition behind the above definition of common information.
It's because $C$ is function of $f$, and we always have $H(X, f(X)) = H(X)$: $$H(X, f(X)) = \\ - \sum_x \sum_t P(X = x, f(X) = t) \log P(X = x, f(X) = t) =\\ -\sum_x P(X = x, f(X) = f(x)) \log P(X = x, f(X) = f(x)) = \\ -\sum_x P(X = x) \log P(X = x) = H(X)$$
The second equality is because $P(X = x, f(X) = t) = 0$ if $t \neq f(x)$, and the third is because $P(X = x, f(X) = f(x)) = P(X = x)$.
Intuitively, if we add more states (now we have value not only of $X$, but also of $f(X)$), entropy can't decrease, but we also can't get entropy from nowhere by adding more description of data we already had.