I would like to know why mutual information is always non-negative. I learnt that mutual information I(X;Y) = H(Y) - H(Y|X). However, it seems to me that H(Y) is just a distribution, while Y|X is another distribution, since it is a slice of the joint distribution at a particular x. And hence the probability of Y|X could have more entropy than the marginal probability distribution of X. And thus I(X;Y) could be negative.
However it is stated otherwise. What is an intuitive explanation? I have seen proofs that use KL divergence and jensen's inequality, but it is not intuitive to me.