If we have the markov triple
\begin{align*} X \rightarrow Y \rightarrow \hat X \end{align*}
Where $X$ is a random variable, $Y$ a random variable correlated to $X$ and $\hat X$ is our estimate of $X$.
Using Fano's inequality we are able to bound the error $P_e = \mathbb{P}[\hat X \neq X ]$ we make with respect to the conditional entropy $H(X| \hat X)$. The result wil be:
\begin{align*} P_e \ge \frac{H(X\mid Y) - 1}{log |\chi|} \end{align*}
However, a part of the proof includes to show that
\begin{align*} H(E,X \mid \hat X) &= H(X\mid \hat X) + \underbrace{H(E \mid X, \hat X)}_{= 0} \\ &= H(E\mid \hat X) + H(X \mid E, \hat X) \end{align*}
Where $E$ is a RV indicating whether $X$ is equal to $\hat X$
\begin{align*} E = \begin{cases}1 & \text{if } X \neq \hat X \\ 0 & \text{if } X = \hat X \end{cases} \end{align*}
If we look at $H(E,X \mid \hat X)$ we already see that:
\begin{align*} H(X\mid \hat X) = H(E\mid \hat X) + H(X \mid E, \hat X) \end{align*}
And since $H(E\mid X) \leq H(E) = H(P_e)$ we immediately see that
\begin{align*} H(X\mid \hat X) \leq H(P_e) + H(X \mid E, \hat X) \end{align*}
Here comes the part I do not quite understand about $H(X \mid E, \hat X)$. In the textbook it's stated that
\begin{align*} H(X\mid E, \hat X) &= \mathbb{P}[E=0]H(X \mid \hat X, E = 0) + \mathbb{P}[E=1]H(X\mid \hat X, E = 1) \\ &\leq (1 - P_e) \cdot 0 + P_e \log_2 |\chi | \end{align*}
Hence
\begin{align*} H(P_e) + P_e \log |\chi | \geq H(X\mid \hat X) \geq H(X \mid Y) \end{align*}
where the last inequality follows from the data-processing inequality.
What I am not getting is why $H(X\mid E, \hat X)$ can be written as
\begin{align*} H(X\mid E, \hat X) &= \mathbb{P}[E=0]H(X \mid \hat X, E = 0) + \mathbb{P}[E=1]H(X\mid \hat X, E = 1) \end{align*}
and also why this is limitted by $P_e \log_2 |\chi|$. The equation tells me that $H(X \mid \hat X, E = 0) = 0$ and my textbook argues that
"Since given $E = 0$, we have $X = \hat X$."
but still I don't see how this is equal to $0$.
Further, I've stated $H(E \mid X, \hat X)= 0$ and this is argued because $E$ is a function of $X$ and $\hat X$. But why is $H(X \mid E, \hat X)\neq 0$?
Intuitively I'd say $H(E \mid X, \hat X)= 0$ because once I know $X$ and $\hat X$ I also know $E$, hence there is no more information to be added. However, this is not the same for $H(X \mid E, \hat X)$ - but how can I argue this?


It's a property (sometimes a definition) of conditional entropy: the conditional entropy equals the expected value of the entropy conditioned on particular values of the conditioning variable:
$$H( X | Y) = \sum_y P(Y=y) H(X | Y=y)$$
The formula above just applies this to the random variable $E$ (it takes two values, $0$ and $1$).
$$H(X \mid \hat X, E = 0) = 0$$
This can be seen thus: $E=0$ implies that there was no error in the estimation; hence, knowing this, and knowing the estimated value $\hat X$, we just know the value of $X$ ($X = \hat X$), so we have no incertitude about $X$, hence the entrop is zero.
I'm not sure if this answers all your doubts.