In an information-theoretic context, we have some set of messages $M$ which are transmitted across a 'noisy channel'. In Cover & Thomas (specifically the proof of Theorem 7.12.1), they assume (without explanation) that the messages to be transmitted across the channel are distributed uniformly over $M$.
This assumption has been eating away at me - is there any explanation for why the authors do this?
Justification for paying particular attention to the uniformly distributed case be made in a number of ways. I will provide an overview of the two different general arguments below. T.M Cover also gives a brief explanation (in the first paragraph of p195) of why the case where $W$ is uniformly distributed over the possible indexes is of particular interest.
Look at this by considering what we ultimately need to do here. We need to analyse the probability of error of some channel with a rate $R$ when used to transmit some source. Now, the source can have any entropy $H$ where $H$ can be any value in between $0$ and $R$ (i.e $H \in [0, R]$ ). When the entropy of the source is $0$ it is trivial to see that the minimum achievable probability of error is also $0$ and when the entropy of the source is $R$ we will denote the minimum achievable probability of error as $P_{\epsilon, R}$. For all values of entropy in between these limits the minimum achievable probability of error will be between $0$ and $P_{\epsilon, R}$ (this is proved separately in the chapter on rate-distortion). We can thus choose to analyse the probability of error when the channel is used to transmit a source of entropy $H(X) = R$ because we know that for any other valid value of $H$ the minimum achievable probability of error will be less than (i.e. upper bounded by) $P_{\epsilon, R}$. Let $P_{\epsilon, H}$ denote the minimum achievable probability of error when a source with an entropy $H$ is transmitted over the rate $R$ channel.
What follows from all this is that if we can prove that a particular probability of error ($P_e$) is achievable when $H(X) = R$ then it follows that the minimum achievable probability of error when $H(X) = R$ is at most $P_e$ ( i.e. $P_{\epsilon, R} \leq P_e$). Or in other words $P_e$ upper bounds $P_{\epsilon, R}$. We thus now know that $P_{\epsilon, R}$ is upper bounded by $P_e$, and because $P_{\epsilon, R}$ acts as an upper bound for all $P_{\epsilon, H}$, it then follows that all $P_{\epsilon, H}$ are upper bounded by $P_e$. $P_e$ thus acts as an upper bound for the minimum achievable probability of error for all valid values of $H(X)$. If we take $P_e$ to be the probability of error achieved when we uniformly index elements of the typical set then it can be shown that $P_e \rightarrow 0$ as $n \rightarrow \infty$, which proves that minimum achievable probability of error for all valid values of $H(X)$ all tend to $0$ (since all of the minimum achievable probability of error are upper bounded by $P_e$).
Recall that if some random variable $X$ that follows some distribution $p_X$ (i.e. $X \sim p_X$) is sampled $n$ times where $n \rightarrow \infty$ then the probability that the resulting sequence is a member of the typical set of $p_X$, $A_{\epsilon, X}$, is $1 - \epsilon$. Since the total number of elements in the typical set is approximately $2^{nH(X)}$ and each sequence is equiprobable, the probability of occurrence of any particular sequence in $A_{\epsilon, X}$ is approximately $(1 - \epsilon)2^{-nH(X)} \approx 2^{-nH(X)}$.
What follows from this is that if you need to transmit $n$ samples of some $X$ over a channel, then the transmission can be done with an arbitrarily small probability of error $P_e$ by sending the index $W$ in $A_{\epsilon, X}$ of the sequence that was observed. The index will be uniformly distributed over the set of indexes $\{1, 2, ..., 2^{nH(X)}\}$ because there are only $2^{nH(X)}$ equiprobable sequences in $A_{\epsilon, X}$.
Note that the entropy of each index is $nH(X)$ so in order to successfully transmit this source over a channel, you need a channel that is capable of transmitting at a rate of at least $nH(X)$ bits for every $n$ samples output from the source, or in other words you need a source that can transmit at least $H(X)$ bits for every sample output from the source (i.e $R \geq H(X)$).
Another arguments is to consider two rates $R_1$ and $R_2$, where $R_2 > R_1$ and $R_1 = H(X)$. Let $P_{\epsilon, R_1}$ be the minimum achievable probability of error (i.e the distortion) when a channel with a rate $R_1$ is used to transmit a source of entropy $H(X)$ for a fixed $n$. $D(R_1) = P_{\epsilon, R_1}$, where $D(R)$ is the distortion-rate function. Similarly, $D(R_2) = P_{\epsilon, R_2}$.
If $R_2 > R_1$ then $D(R_2) < D(R_1)$ because $D(R)$ is strictly decreasing (this is proved separately in the chapter on Rate-Distortion theory). What follows from this is that if $R_1 = H(X)$ and $D(R_1) = P_{\epsilon, 1}$, then for all rates $R_i > R_1 = H(X)$ we have $D(R_i) = P_{\epsilon, i} < P_{\epsilon, 1}$.
$P_{\epsilon, 1}$ thus acts as an upper bound for the distortion and thus proving that $P_{\epsilon, 1} \rightarrow 0$ is sufficient to prove that the minimum achievable probability of error will tend to $0$ for all $R_i > R_1 = H(X)$. At the point where $R = R_1 = H(X)$, the distribution of the index $W$ is uniformly distributed over the typical set.