In most introductory textbooks on information theory, the entropy of a discrete random variable (r.v.) is defined as
$$H(X) \triangleq -\sum p(x)\log p(x)=-\mathbb E[\log p(X)],$$
where $p$ is the pmf of $X$; while that of a continuous random variable is given by
$$h(X)\triangleq -\int f(x)\log f(x) dx=-\mathbb E[\log f(X)],$$
where $f$ is the pdf of $X$.
Question: Is there a unified definition of entropy for arbitrary random variable?
My question is motivated by Robert M. Gray's book, "Entropy and Information Theory." In his book, he provide a unified definition of divergence and hence for mutual information:
Given a probability space $(\Omega, \mathcal B, P)$ and another probability measure defined on the same space, define the divergence of $P$ with respect to M by
$$D(P\, \Vert\, Q)\triangleq \sup_{\mathcal Q} \sum_{Q\in \mathcal Q}P(Q)\log\frac{P(Q)}{M(Q)},$$
where $\mathcal Q$ is any finite measurable partition of $\Omega.$
For any two random variables, define $$I(X;Y)\triangleq D(P_{XY}\Vert P_X\times P_Y),$$ where $P_{XY}$ and $P_X\times P_Y$ are the joint distribution and product distribution of $X$ and $Y$, respectively.
From my understanding, the nice thing about this definition of mutual information is that it is a unified definition which works for arbitrary random variables, and it reduces to the usual definition of $I(X;Y)$ when $X$ and $Y$ are discrete, or continuous r.v.'s.
In Gray's book, he then goes on to define the entropy in terms of mutual information defined above:
$$H(X)\triangleq I(X;X).$$
For a discrete r.v., this also reduces to the regular definition of entropy given at the very beginning of this question. Perfect. However, if $X$ is a continuous r.v., say Gaussian, then I think this definition gives $H(X)=\infty,$ since it implies that
$$H(X)=\sup_q H(q(X)),$$ where $q$ is any quantizer of $X$. So it appears inconsistent with $h(X)$, the usual finite (differential) entropy, doesn't it? Hence the question.
Or how do we reconcile this inconsistency? Does it make more sense to have the entropy of a continuous r.v. defined as infinity, don't use it, and just use its mutual information with other random variables? If we must, we just use its differential entropy, $h(X),$ and have it defined differently from its entropy $H(X)$?
First of all, differential entropy almost can't be the right definition, and in my opinion it's mostly a historical accident. There are a number of reasons, but the clearest one is that it is coordinate dependent. Consider the random variable $X$ on $[0,1]$ whose cumulative distribution function is $F(x) = x^2$ and whose density function is $F'(x) = 2x$. Its differential entropy is:
$$H(X) = -\int_0^1 2x \log(2x)\, dx = \frac{1}{2} - \log 2$$
Now apply the coordinate change $u = x^2$. The new cumulative distribution function is $G(u) = u$, so the new density function is $G'(u) = 1$ and the differential entropy is:
$$H(X) = -\int_0^1 1 \log 1\, dx = 0$$
I've never heard a convincing explanation for why an information theoretic quantity should depend on the coordinate system. (Side note: in the first coordinate system $H(X) < 0$, which is also a bit pathological.)
Edwin Jaynes was the first to notice this problem, and he also figured out the solution (though I'm not sure he would have expressed it the same as I will). If you think carefully about how entropy arises in practice, it is almost always as a relative quantity: the quantity of interest in applications is usually the information gained relative to some prior. In the discrete case the prior is usually just the uniform distribution so you don't lose much by pretending that entropy is an absolute quantity, but in the continuous case you often don't have a uniform distribution (e.g. there is no uniform probability measure on $\mathbb{R}$) and even when you do it looks different in different coordinate systems.
So the solution is to work relative to an explicit prior by considering the KL-divergence to be the fundamental object. You gave one definition, but a better definition uses Radon-Nikodym derivatives. Assume that $P$ is a probability measure which is absolutely continuous with respect to $Q$ (if it is not then the relative entropy has to be infinite) and define:
$$D(P \| Q) = \mathbb{E}_P \left( \log \frac{dP}{dQ} \right) = \int P \log \frac{dP}{dQ}$$
where $\frac{dP}{dQ}$ is the Radon-Nikodym derivative.
Now, given a real-valued random variable $X$ on a probability space $(\Omega, \Sigma, P)$, push $P$ forward along $X$ to obtain a measure $P_X$ on $\mathbb{R}$. Formulate a prior as a probability distribution $Q$ on $\mathbb{R}$ (normalized Lebesgue measure on a large interval is a common choice) and consider the quantity $D(P_X \| Q)$. This is almost always a better object to work with in applications than the differential entropy of $X$.