Noisy-channel coding theorem, or Shannon's theorem, states that for a discrete memoryless channel the code rate below the channel capacity is achievable. However, in the standard proof that I have seen in many materials, the input message is assumed to be uniformly distributed without satisfactory explanation. This assumption plays an important role in the proof as it symmetrizes the probability such that $P(\hat W=W)=P(\hat W=W\mid W=1)$ and we can assume without loss of generality that the message $W = 1$ was sent.
There exists a similar question. However the answers are somehow not complete or not rigorous.
Here are several possible explanations.
For a given rate, the optimal maximum probability of error will increase if the entropy of the source is increased, so it suffices to show when the source entropy is maximized, resulting in a uniform distribution. I'm not sure whether this is true and cannot find a proof.
We will only index the typical sequences in the source and transmit these indexes over the channel. Since all the typical sequences in the typical set are roughly equiprobable, we may assume uniform distribution in the theorem. I don't think this is a rigorous analysis. The "equiprobability" is based on $p(\mathbf X)\approx 2^{nH}$ obtained from $|-\frac1n \log p(\mathbf X)-H|<\varepsilon$ for sufficiently large $n$. However, for a given $\varepsilon>0$, $2^{-n(H+\varepsilon)}<p(\mathbf X)<2^{-n(H-\varepsilon)}$, and the ratio of the upper to lower bound $2^{2n\varepsilon}\to\infty$ as $n\to\infty$. Therefore I don't think "equiprobability" can apply when $n$ is large.
If the input message is not uniformly distributed, then one can further compress the raw message bits into a smaller length sequence of independent and uniformly distributed bits, subject of source coding theorem, while involving only a negligible error in reconstruction. However, I don't understand how to compress in such a way.
Distortion rate function. I don't know how to use it for proof.
Any help is appreciated.
Of course, in general average error control is distinct from maximum error control (i.e., bounds on $\max_m P(\mathrm{err}|M = m)$, e.g., you can have a code that simply does not recover some message ever, in which case the maximum probability of error is close to $1$, even if on average over the messages this is small. One of the early papers dealing with maximal probability of error control is by Feinstein [1], that dealt with the maximal probability of error as the controlled object, and showed again that $R< C$ is enough to communicate at large $n$. The language is somewhat archaic, but see section II to both verify that the definition deals with maximal error, and that the result is the same. A better source, perhaps, is Csiszar and Korner's textbook [2]. Their chapter on channel coding explicitly separates the average and maximum probability of error, and deals mostly with the latter when proving results. The main point I want to make is that yes, this is a legitimate worry, but its been dealt with, and essentially only needs a little more refined approach to the basic concepts.
That said, I do want to say that the justification 2 is not so unreasonable, and your critique of it is incomplete. The random coding proof typically shows that if $R < C - \varepsilon,$ then the block probability of error is $\exp( - n \delta(\varepsilon))$, where $\delta$ is typically linear in $\varepsilon$. So then to argue that under approximate uniformity, average probability of error control yields maximum error control, one only needs to choose the $\varepsilon_{\mathrm{cx}}$ and the $\varepsilon_{\mathrm{tx}},$ where $\mathrm{cx}$ denotes compression, and $\mathrm{tx}$ denotes transmission, together so that $\exp( -n(\delta(\varepsilon_{\mathrm{tx}}) - \varepsilon_{\mathrm{cx})}) \to 0,$ which it a matter of playing with the $\varepsilon$s, of course. In fact it isn't even necessary that $n_{\mathrm{tx}} = n_{\mathrm{cx}}$ (indeed, depending on alphabet, their ratio is sometimes how rate is defined), so there's even more leeway.
[1]: Feinstein A. (1954). A new basic theorem of information theory. IRE Trans. Inform. Theory, vol. 4, no. 4, pp. 2-22, 1954.
[2]: Csiszár, I., & Körner, J. (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems (2nd ed.). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511921889