I seem to have a fundamental confusion regarding the channel coding theorem which I would like to resolve. In the theorem, we say that there exists an input distribution which maximises $I(X; Y)$ and that this quantity is the maximum rate at which symbols can be transmitted across the channel.
My question is, what does it mean for there to be a maximising input distribution? Since the the maximum achievable rate is independent of the message being sent, and the encoded message may not have characters drawn according to this maximising distribution, how is it that the theorem describes the maximum rate for any message you may want to send? Or is it that the optimal code somehow makes the encoded message be according to the maximising input distribution?
I am sorry if my question is unclear; I will try explaining further if needed.
The point is that the capacity is a property of the channel. And a channel is specified by $P(Y|X)$ (probabilities of output given input).
The mutual information $I(X;Y)$ depends of the joint probability of $X$ and $Y$, which can be writen $P(X,Y)=P(X)P(Y|X)$. This suggests that $I(X;Y)$ depends not only on the channel, but on the distribution of the input.
Fixing the channel, for different input distributions $P(X)$ we'd get different mutual information (which is to say: trasmitted information). To put an extreme case: if we had a perfect binary channel (no errors), but we use an input with zero entropy (say, $X=0$ with probability $1$), then the mutual information is null : $I(X;Y)=H(X)-H(X|Y)=0-0=0$ , which is reasonable.
So, the capacity of the channel is defined as the maximum amount of information that we can trasmit, assuming fixed the channel characteristic (transition probabilities) but allowing any input distribution.
I don't quite understand that. To attain the capacity you must use a channel encoding that produces a channel input that gives $P(X)=P_{max}(X)$
It's true that in most theoretical developments (most codes, like, say, the Hamming codes) we go after a symmetric input distribution $P(X=0)=P(X=1)=1/2$ (the channel coding produces same amount of zeroes than ones), but that's because implicitly we are assuming that the channel is binary and symmetric (and the unencoded inputs are equiprobable). And for that channel, that $P(X)$ is the one that attains the capacity.