To calculate the entropy for a an event of a uniform distribution of this kind
- Events: "red" or "blue"
- Probabilities: P(red) = 0.5, P(blue) = 0.5
We just do $e = -\log_2(0.5)$
In this case it yields 1. Which I guess it means we need one bit.
If we have 4 values of 0.25, we need 2 bits. No surprises here.
However, lets say we have this non uniform distribution:
- Events: "red" or "blue"
- Probabilities: P(red) = 0.125, P(blue) = 0.875
Now, "red" should be represented with 3 bits, is this correct ? Why don't we still use just 1 bit?
No. You could say that "red" gives 3 bits of information; and you could also say that the "ideal" code should assign 3 bits to "red" and 0.19 bits to blue - so that in average you use 0.54 bits, which is the entropy. But of course, you can't do that. The optimal code is still 1 bit for each symbol.
To attain the entropy (and use few bits for blue -or blueish- symbols) you need to code the extensions of the source. See eg here.