I basically understand Fano's Inequality and what it's conclusion is. However, I don't understand the beginning of the proof. The textbook I am using does not explain why we show Fano's Inequality with $H(E,X|\hat{X})$. My question is, why is it not $H(E|X,\hat{X})$. I know $H(E|X,\hat{X})$ is zero but if I had to put it in a sentence I'd say
"$H(E|X,\hat{X})$ is the entropy of the error random variable $E$ given the actual value $X$ and it's estimate $\hat{X}$."
Yet, the proof goes
\begin{align*} H(E,X|\hat{X}) &= H(X|\hat{X}) + H(E|X,\hat{X}) \\ &= H(E|X) + H(X|E,\hat{X}) \\ \Rightarrow 1 + P_e \log_2 |\mathcal{X}| &\geq H(X|\hat{X}) \geq H(X|Y) \end{align*}
Based on the Markov triple
\begin{align*} X \rightarrow Y \rightarrow \hat{X} \end{align*}
Where we replace $Y$ by $E$.
Can somebody explain to me why it's starting off with $H(E,X|\hat{X})$? How do I know I have to "start" like this?
That's a typical trick, that we use when knowing something (say, $A$) elliminates all the uncertainty about something (say, $B$) - hence $$H(B\mid A)=0 \tag{1}$$
We write the chain rule: $H(A,B)=H(A \mid B) + H(B) = H(B \mid A) + H(A)$. We don't write this because we are really interested in the joint $H(A,B)$, only because this allows us to equate the two expressions on the right, so that one term vanishes and we get $$H(A)=H(B) + H(A \mid B) \tag{2}$$
In the Fano proof: in principle, $E$ has some uncertainty , and so has $E \mid \hat X$. But if we add to that the knowledge of the real value $X$ , then all uncertainity vanishes : hence we use the above, replacing $A$ by $X$ and $B$ by $E$, all preconditioned on $ \hat X$.
That's why "we start" from $H(X,E|\hat{X})$
(Of course, there were other alternatives: we could have chosen the pair $\hat{X},X$ as the variable $A$, or even $\hat{X}$ with all conditioned on $X$. But that would not have been useful. The thing is that, in our scenario, $\hat{X}$ is always known, hence it makes sense to just put it in all the equations as a global conditioning variable).