Interpreting Fano's Inequality

317 Views Asked by At

Fano's inequality says that if I estimate a discrete $\mathcal X$-valued random variable $X$ by observing the discrete $\mathcal Y$-valued random variable $Y$ and applying a (possibly stochastic) estimator $g :\mathcal Y \to \mathcal{X}$ via $\hat X = g(Y)$, then $$ P(\hat X \neq X) \geq \frac{H(X | Y) - 1}{\log_2 |\mathcal X|}, \tag{1} $$ where $H(X|Y)$ is the conditional entropy of $X$ given $Y$.

Though adding zero-probability values to $\mathcal X$ does not change entropy, it does change the denominator $\log_2|\mathcal X|$ and hence Fano's bound. This bothers me. The largest bound is achieved by eliminating zero-probability elements of $\mathcal X$. Can this be explained alternatively by somehow incorporating in $Y$ knowledge of zero-probability elements of $\mathcal X$?

Let me elaborate on what's bothering me. If $Y$ somehow tells me that there are zero-probability elements $\mathcal Z \subset \mathcal X$, then I could "start over" when formulating Fano's inequality. By this I mean declare a $\tilde{\mathcal{X}} = \mathcal X \setminus \mathcal Z$-valued random variable $\tilde X$ and then obtain $$ P(\hat{\tilde{X}} \neq \tilde X) \geq \frac{H(\tilde X | Y) - 1}{\log_2|\tilde{\mathcal{X}}|}. \tag{2} $$ As far as I can tell, $H(\tilde X | Y) = H(X | Y)$, so (2) gives a larger lower bound on the error than (1).

My intuition is that one shouldn't have to "start over" but should be able to alternatively explain this via $H(X | Y)$.

1

There are 1 best solutions below

0
On

In Fano's inequality, the denominator is formally $\log(|\mathrm{supp}(X)| - 1),$ where $\mathrm{supp}(X)$ is the support of $X$, i.e. $\{x \in \mathcal{X}: P_X(x) > 0\}$. This automatically handles the case where dummy labels with no mass are chucked into $\mathcal{X}$. In fact even more is true if you're willing to make the bounds depend on the estimation process.

If $E$ is the indicator random variable that is $1$ when there's an error, and $\hat{X}$ is the estimate, the denominator in Fano's inequality can be taken to be $H(X | E =1, \hat{X})$ (see, e.g., this proof for why). This is uniformly bounded by the log of the support of $X$ minus $1,$ and the usual bound is stated in this generic way to remove details about the estimation process (this is because usually we use the inequality to handle the case for any estimator possible in order to give universal bounds for probability of error). However, if the value of $\hat{X}$ in fact leaks further information such as restricting the support of $X$, and you're only interested in analysing this particular estimator, then indeed the bound increases.