Why is the conditional entropy defined as $H(Y\mid X) = \sum_{x\in X} p_X(x) H (Y\mid X = x)$

164 Views Asked by At

In the book "Elements of Information Theory" $H(Y\mid X)$ is defined like that and then it's shown that this is

\begin{align*} H(Y\mid X) &= \sum_{x\in X} p_X(x) H (Y\mid X = x) \\ &= - \sum_{x\in X} \sum_{y \in Y} p(x,y) \log_2 p(y\mid x) \\ &= \mathbb{E}\big[ - \log_2 p(Y\mid X) \big] \\ &= - \mathbb{E}\big[\log_2 p(Y\mid X) \big] \end{align*}

I fail to understand why $H(Y\mid X)$ is actually "defined" like that. What's the justification for this definition?


I think I'm confused because

\begin{align*} H(Y,X) = - \sum_{x\in X \\y\in Y} p(x,y) \log_2 p(x,y) \end{align*}

but

\begin{align*} H(Y\mid X) \neq - \sum_{x\in X \\y\in Y} p(y\mid x) \log_2 p(y\mid x) \end{align*}

and I don't see why.

1

There are 1 best solutions below

5
On

A random variable $Y$ has an entropy $H(Y)$, which gives you the amount of information (in bit), required to describe the random variable, on average. When using the conditional entropy, we have some extra information, described with random variable $X$, that is, hopefully, somehow correlated with $Y$. So, if, for example, $x_1 \in X$ happens, we would know that some outcomes of $Y$ are more probable. Assume $x_i\in X$ are possible outcomes of $X$. $H(Y|x_i)$ is the amount of information, required to exactly pinpoint an event in $Y$, on average, given that $x_i$ has happened. Finally, you have $H(Y|x_i)$ for all $x_i\in X$. Therefore, you should take the average of them, considering their probability of occurrence $p_X(x_i)$, to see how much information is required on average to pinpoint events of $Y$, given you have a stream of information from $X$.