I am trying to understand how the formula I(x) = -log(p(x)) for self information was derived.
From what I have read, 2 constraints were imposed on the properties that we would like self-information to satisfy. These constraints are listed below:
I(x) < I(y) if p(x) > p(y)
I(x and y) = I(x) + I(y) if P(x and y) = p(x).p(y)
Following this we somehow find out that I(x) = -log(p(x)) satisfies the above requirements.
My exact questions are:
- Why did we define the requirements for self-information as above?
- How did we arrive at
I(x) = -log(p(x))? - How do we know that
I(x) = log(p(x))uniquely satisfies the above requirements?
Just because they seem reasonable, they fit with the concept we are trying to formalize. The first one says that an event with low probability give us high information. The second one says that the information a pair of two independent events is the sum of the individual informations.
Let's state our "axioms"
It's easy to see that $g(p)= - \log_b (p)$ (arbitrary base) satisfies those requirements. To prove uniqueness is a little more difficult. Sketch: Suppose we have $n$ events with probability $p$, then the above implies (by induction) $g(p^n)=n g(p)$. By considering that $g(p)=g([p^{1/m}]^m)=m g(p^{1/m})$ we deduce that $g(p^a)=a \, g(a)$ holds also for any rational $a$ - and, by continuity for any real $a$. This leads to the unique (up to the base choice) solution $g(p)= - \log_b (p)$