I am studying basic language models before than CRF from http://www.eng.utah.edu/~cs6961/papers/klinger-crf-intro.pdf.
But I am stuck at the page 6.
It says the conditional entropy as
$H(y|x)=-\sum_{(x,y) \in Z} p(y,x) \log p(y|x)$
Should not it be
$H(y|x)=-\sum_{(x,y) \in Z} p(y|x) \log p(y|x)$
Why is it $p(y,x)$ instead of $p(y|x)$? (https://en.wikipedia.org/wiki/Entropy_(information_theory))
Can we use the both interchangeably?
Thank you.
The correct definition is the one appearing in the text. You may understand the conditional entropy as follows. Let $Y$ and $X$ be random variables, with a joint distribution $p_{X,Y}(x,y)$. Now, given $X=x_0$, $Y$ is described by the conditional distribution $p_{Y|X}(y|x_0)$, and has an entropy equal to $$ H(Y|X=x_0) = -\sum_y p_{Y|X}(y|x_0) \log p_{Y|X}(y|x_0) $$ Note that the notation $H(Y|X=x_0)$ is unconventional (although appearing every now and then). Here, it only serves to remind that we are actually computing the entropy of $Y$ using the standard definition, however, since we know that $X=x_0$, we use the conditional distribution of $Y$ given $X=x_0$ (instead of the marginal distribution $p_Y(y)$).
Now the conditional entropy of $Y$ given $X$ (not $Y$ given $X=x_0$; note the difference) is the average of $H(Y|X=x_0)$ over all possible $x_0$, that is
$$ \begin{align} H(Y|X) &= \sum_{x_0}p_X(x_0) H(Y|X=x_0) \\ &=-\sum_{x_0} \sum_y p_X(x_0)p_{Y|X}(y|x_0) \log p_{Y|X}(y|x_0) \\ &=-\sum_{x_0} \sum_y p_{Y,X}(y,x_0) \log p_{Y|X}(y|x_0), \end{align} $$ which is the same formula as the one in your text.