Rissanen's Minimum Description Length principle shows that, for a set alphabet $A$ of $n$ symbols with probabilities $\{p_1, ..., p_n\}$, one can construct a set of $n$ prefix binary codewords, using the shorter codewords for more frequent symbols in such a way that, for a randomly chosen sequence of these symbols, one can encode them with an expected per-symbol length
$L(A) \le -\sum_{i}{p_i log(p_i)}$
But, suppose that we want to allow, for each timestep in the sequence, not just the $n$ individual symbols, but also ambiguous communication? For instance, "at timestep t, we send 'either symbol 3, 5, or 6'". The total probability of that would be $p_{amb} = p_3 + p_5 + p_6$. And, intuitively, the "information content" in such a communication ought to be equal to that of communicating a single symbol with that same probability, $p_{amb}$.
However, there doesn't seem any way to actually construct a code which allows for ambiguous combinations of more than one symbol. But, is the construction of a code just a semantic device to help make the concept of information more concrete? Or, is it necessary for the theory?
If a random variable $X$ picks symbols randomly and uniformly from an alphabet $\mathcal{X}$ then the entropy per symbol is $\log_2(|\mathcal{X}|)$ where $|\mathcal{X}|$ denotes the number of symbols in the set $\mathcal{X}$.
If another random variable $Y$ consists of noisy observations of the random variable $X$ then the random variable $Y$ is said to be observing the variable $X$ through a noisy channel. The average number of information bits that are being transmitted from $X$ to $Y$ per symbol is called the mutual information and is denoted $I(X;Y)$. It depends on both the distribution of the noisy channel and the input distribution $X$.
\begin{align} I(X;Y) &= (\text{Entropy of X}) - (\text{Entropy of X given Y}) \\ &= H(X) - H(X|Y)\\ \end{align}
Now if X is uniformly distributed then
$$ I(X;Y) = \log_2|\mathcal{X}| - (\text{Entropy of X given Y}) $$
The entropy of $X$ given $Y$ is the amount of information bits you still do not know about $X$ after receiving the noisy symbol $Y$. Let $\mathcal{X}_s$ be some subset of the set $\mathcal{X}$. If after receiving the noisy observation $y$ the receiver now knows that the symbol $x$ is a member of some subset $\mathcal{X}_s$ and the receiver also knows that for each symbol $x_s$ in $\mathcal{X}_s$ the probability that the observed symbol was $x$ is $p(x_s = x)$, then the entropy of $X$ given $Y$ can be shown to be the entropy of $\mathcal{X}_s$ over the distribution $p_{X_s}$.
If you only had access a noisy observation $y$, and another receiver had access to both the same noisy observation you had and also had access to the true observed value $x$, then all that receiver needs to send you for you to be able to resolve $x$ is the information about the symbol $x_s$ in $\mathcal{X}_s$ such that $x = x_s$. The minimum number of bits needed for this is by definition the entropy of the set $\mathcal{X}_s$ over the distribution $p_{X_s}$.
If $x_s$ is uniformly distributed in $\mathcal{X}_s$ then it is clear that the entropy of $X_s$ will be $\log |\mathcal{X}_s|$, but if the distribution over $\mathcal{X}_s$ is not uniform then the entropy of $X_s$ will be less than $\log |\mathcal{X}_s|$. This is because the uniform distribution maximises entropy.
\begin{align} (\text{Entropy of X given Y}) &= (\text{Entropy of } X_s) \\ &\leq \log_2 |\mathcal{X_s}| \end{align}
We can thus deduce that for uniformly distributed $X$ we have
\begin{align} I(X;Y) &= (\text{Entropy of X}) - (\text{Entropy of X given Y}) \\ &= H(X) - H(X | Y) \\ &= \log |\mathcal{X}| - H(X | Y) \\ &\geq \log |\mathcal{X}| - log |\mathcal{X_s}| \end{align}
In the 8 sided die case you outlined in the comments you have $X$ uniformly distributed in the set $\mathcal{X}$ with $|\mathcal{X}| = 8$. After the reception of the noisy variable $Y$ you now know that $x$ belongs to a subset $\mathcal{X}_s$ of $\mathcal{X}$ with $|\mathcal{X}_s| = 2$ so $I(X;Y) \geq (\log 8 - \log 2) = 2$.
Examples:
i) If after receiving the noisy variable Y, you will know that the observed symbol $x$ was equal to one of N symbols, with the probability of x being either one of those N symbols being equal then $I(X;Y) = \log |\mathcal{X}| - H(\mathcal{X_s}) = \log |\mathcal{X}| - log |\mathcal{X_s}| = \log 8 - \log N$.
ii) If after receiving the noisy variable Y, you will know that the observed symbol $x$ was equal to one of N symbols, with the probability of x being the $i^{th}$ of the N symbols being $p_i$ then $I(X;Y) = \log |\mathcal{X}| - H(\mathcal{X_s}) = \log |\mathcal{X}| - \sum _{i \in [1, N]} p_i\log \frac{1}{p_i}$ .
iii) If after receiving the noisy variable Y, you will know that the observed symbol was exactly $x$. $I(X;Y) = \log |\mathcal{X}| - H(\mathcal{X_s}) = \log |\mathcal{X}| - \log |\mathcal{X_s}| = \log 8 - \log 1 = 3$.