Clarifying Derivation of Entropy

271 Views Asked by At

I'm learning about probability from the book Pattern Recognition and Machine Learning by Christopher Bishop. It includes a justification for the definition of entropy that can be summarized as:

let $x$ and $y$ be independent events, that is $$p(x,y) = p(x) \cdot p(y)$$ and

$$h(x,y) = h(x) + h(y)$$ because entropy is designed to mean amount of surprise and independent events should have separate contributions of surprise.

The definition $$h(x,y) = \log_2 p(x,y)$$ is one of a family of definitions that satisfy the properties we want (other logarithm bases are obvious, but maybe there are other functions with these properties).

However, Bishop doesn't write that equation for $h(x,y)$ and jumps right into saying $$h(x) = \log_2 p(x)$$ It seems like so far, the line of thought has been tied to joint distributions. Does probability define some sort of identity event such that $p(x,a) = p(x)$? Maybe $p(x,x) = p(x)$ is such a thing? Or maybe that is not a sensible thing to think of, and I'm missing some sort of notational point, and entropy has some special connection to situations involving multiple events?

2

There are 2 best solutions below

0
On

Define a random vector $Z$ which is a result of concatenating $X$ and $Y$. $H(X,Y) := H(Z)$.

0
On

Bishop expects you to fill in a crucial gap. Instead of y as the other variable, repeat the experiment so that y is again x. Repeat the experiment so that each of the random variables are independent and identically distributed.

He writes $h(p(x, x)) = h(p(x)*p(x)) = h(p^2)$ to tell you that he has repeated the experiment one time. Once you look at it this way, the interpretation becomes the following straight forward one.

If you repeat the experiment, you will have some surprise every time you see the outcome. The surprise at a certain run will not depend on the surprise you had when you saw the outcome of the previous run. This is why you can write $h(p^2) = h(p*p) = h(p) + h(p) = 2h(p) $.