I am reading Shannon's famous paper and I stuck on "It is easily shown that" part. More precisely I mean the inequality $$H(X, Y)\leq H(X) + H(Y),$$ where $X, Y$ are some discrete random variables. Following the paper we have that $$H(X,Y)=-\sum_{ij}p(i, j)\log(p(i,j)) \\ H(X)=-\sum_{ij}p(i, j)\log\left(\sum_{k}p(i,k)\right) \\ H(Y)=-\sum_{ij}p(i, j)\log\left(\sum_{l}p(l,j)\right)$$ And if we sum $H(X) + H(Y)$ we get that $$H(X) + H(Y) = -\sum_{ij}p(i, j)\log(p(i)p(j)).$$ And this is where I stuck. It is obvious from the above that if $X,Y$ are independent then the equality holds. However, the inequality is a different story. In general $p(i,j)$ and $p(i)p(j)$ are incomparable by simple inequality, thus some other methods must be used.
What is the most straightforward proof of this inequality? (Or if one knows, what proof Shannon had in mind?). Shannon states the inequality before defining conditional entropy $H_X(Y)$ so I suspect that it in fact this should be easy... or did he lie?
Rearrange the inequality so that one side is zero. Write the other side as a single sum over $i$ and $j$. Use the facts that the sum of two logs is the log of a product, and the difference of two logs is the log of a quotient to replace the three logarithmic terms by a single one. Use convexity of the logarithm.
EDIT: To explain the convexity part, consider that the graph of $\log_a x$ is concave down, for $a>1$. It follows that for all $x_1, x_2 > 0$ and all $s_1, s_2 \in [0,1]$ such that $s_1+s_2=1$, we have that $s_1 \log_a x_1 + s_2 \log_a x_2 \leq \log_a (s_1 x_1 + s_2 x_2)$. This inequality can be extended by induction from $2$ $x$ and $s$ pairs to $n$. In our case, the $p(i,j)$ play the role of the $s_i$.
EDIT 2: The above yields a bound of $\log_a 1$, i.e. of zero.