I've often seen as an explanation for shanon entropy that it represents "the average number of yes/no questions needed to find out which state a given system is in."
This works very well for a system having an even number of possible states, but if I try to do this with let's say, 3 possible states all equiprobable I don't find this to be true. Indeed, using the entropy, $ H = \sum_{n = 1}^{3}\frac{1}{3}\log_2(\frac{1}{3}) = 1.58$ I find an average number of 1.58 yes/no questions. If I encode the $1^{st}$ state as 0, the $2^{nd}$ state as 1 and the $3^{rd}$ as 00, the average number of bits (questions) that I need to characterize the system is $\frac{4}{3}=1.33$.
Did I make a mistake somewhere or does it mean that this explanation with the yes/no questions is just a way to simplify things ?
First, your proposed encoding $A\to 0$, $B \to 1$, $C\to 00$ attains a coding length below the entropy, which should be impossible.
The problem is that your encoding is practically useless, because it's not "uniquely decodable": if you receive $00$ you cannot know if the input was $AA$ or $C$.
Then you need to do something like $A\to 0$, $B \to 10$, $C\to 11$ , which has an average coding length of $5/3=1.666$ , above the entropy ($H=1.585$). This is to be expected.
And if you code each value isolated, then, yes, you cannot perform better than that . But you can code several values together (code "the extension of the source"), and approach the entropy.
Consider for example a group of $n=5$ values. There are $3^5=243$ equally probable joint values. Because $243 < 256 = 2^8$, you can code this with group with $8$ yes-no questions.
Hence, with this simple scheme, you'd need to ask $8$ questions to discover $5$ values, which gives a coding length $8/5=1.6$, near the entropy.
The above is not optimal (see Huffman coding) but you get the idea.