I'm familiar with information and coding theory, and do know that the units of Shannon information content (-log_2(P(A))) are "bits". Where "bit" is a "binary digit", or a "storage device that has two stable states".
But, can someone rigorously prove that the units are actually "bits"? Or we should only accept it as a definition and then justify it with coding examples.
Thanks!
When there are two events, both of which are equally likely, when conveying the news that one of them has actually happened, you convey -log2(0.5)=1 bit of information.
There is no rigorous proof here: just a mapping from a probability space to a bit-space (if I may call so). Either you see a binary random variable with equally likely 0 or 1, or you consider two equally likely stable states, whose representation would entail 1 bit.