Consider a random variable $\mathbb{X}$ with:
$f(x;p) = 2^{-n}$ if x = 1 and $f(x;p) = 1-2^{-n}$
Then the information gained from an experiment where x=1 is discovered is:
$I(p) = -\log{2^{-n}} = n$ bits of information
and the information gained if x=0 is the outcome:
$I(1-p) = -\log{1-2^{-n}} = -[1-2^{-n}-1] = 2^{-n}$ bits (by using the leading term of the Tarlor expansion of log(x)
I'm a bit confused. I thought that you gained a single bit of information every time you eliminated half the hypotheses (David MacKay), but in this case there are only two (0 or 1) yet if (for n=256 for example) you get x=1 as the outcome you gain 256 bits of information, and a 0 outcome gives you almost no information.
How is the unlikely outcome more surprising than the likely one, and why do you gain so much more information?
To clarify I understand that the (binary) logarithm is a measure of information, as it is additive over independent rvs and that it allows a single bit to be gained every time half the hypotheses of an experiment are eliminated (and that it measures the size of a file which encodes X). I just don't understand this example.
EDIT: up vote 0 down vote favorite
Consider a random variable X with:
$f(x;p) = 2^{-n}$ if x = 1 and $f(x;p) = 1-2^{-n}$
Then the information gained from an experiment where x=1 is discovered is:
$I(p) = -\log{2^{-n}} = n$ bits of information
and the information gained if x=0 is the outcome:
$I(1-p) = -\log{1-2^{-n}} = -[1-2^{-n}-1] = 2^{-n}$ bits (by using the leading term of the Tarlor expansion of log(x)
I'm a bit confused. I thought that you gained a single bit of information every time you eliminated half the hypotheses (David MacKay), but in this case there are only two (0 or 1) yet if (for n=256 for example) you get x=1 as the outcome you gain 256 bits of information, and a 0 outcome gives you almost no information.
How is the unlikely outcome more surprising than the likely one, and why do you gain so much more information?
To clarify I understand that the (binary) logarithm is a measure of information, as it is additive over independent rvs and that it allows a single bit to be gained every time half the hypotheses of an experiment are eliminated (and that it measures the size of a file which encodes X). I just don't understand this example. probability information-theory shareeditdeleteflag
asked 1 hour ago Tom Kealy 82 add comment Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook. ok Your Answer
Thanks for contributing an answer to Mathematics Stack Exchange!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Links
Images
Styling/Headers
Lists
Blockquotes
Preformatted
HTML
advanced help »
I'm wrong in what I posted up there, but I can't quite figure out the algebra. Suffice to say, you don't need any restriction on the form of p - a Taylor series type argument will do nicely.
So, the information contained in the unlikely event:
I(p)=−log(p)=−[p−1]=1−p using the Taylor expansion for log(x). That is for an unlikely event happening, you gain (almost) a single bit of information (the rest goes to the more likely event) - that is, when you eliminate half the hypotheses you gain a single bit of information.
You would gain a single bit of information for eliminating half the hypotheses only when all the hypotheses are equally likely. In your case, the two outcomes are far from equally likely for values of $n \geq 2$.
Looking at just $I(p)$ as a measure of "information content" is not entirely correct. According to the noiseless channel coding theorem (you could also call it the source coding theorem), the entropy $H(p)$ gives the number of information bits required per data bit to efficiently encode a stream of data drawn from your distribution. In this case, $H(p) = -p\log p - (1-p) \log (1-p)$. Since $\log p$ is weighted by $p$, the "information contribution" by the symbol $x=1$ is not exponentially greater than the contribution by $x=0$.