Can't find any proof in Shannon's 1948 paper. Can you provide one or disproof?
Thank you.
P.S.
$H(x)$(or $H(y)$) is the entropy of messages produced by the discrete source $x$(or $y$).
$H(x,y)$ is the joint entropy.
They are all entities in information theory.
No it doesn't have to.
$H(X,Y) = H(X) + H(Y|X)$
To lower $H(X,Y)$ while keeping $H(X)$ fixed, you need to lower $H(Y|X)$. You can lower $H(Y|X)$ without lowering $H(Y)$ since $0 \leq H(Y|X) \leq H(Y)$ is a measure on how dependent X and Y is. If they are more dependent, there will be less entropy left in Y after you've learned X so $H(Y|X)$ is lower.
From the inequality above, you can also see that lowering $H(Y)$ would also lower $H(Y|X)$ which in turn lowers $H(X,Y)$, so lowering $H(Y)$ is one way but not the only way.