Towards a consistent notation for entropy and cross-entropy

Question

Towards a consistent notation for entropy and cross-entropy

799 Views Asked by Bumbble Comm At 01 Apr 2026 - 5:00

I am learning about the cross entropy, defined by Wikipedia as $$H(P,Q)=-\text{E}_P[\log Q]$$ for distributions $P,Q$.

I'm not happy with that notation, because it implies symmetry, $H(X,Y)$ is often used for the joint entropy and lastly, I want to use a notation which is consistent with the notation for entropy: $$H(X)=-\text{E}_P[\log P(X)]$$

When dealing with multiple distributions, I like to write $H_P(X)$ so it's clear with respect to which distribution I'm taking the entropy. When dealing with multiple random variables, it thinks it's sensible to make precise the random variable with respect to which the expectation is taken by using the subscript $_{X\sim P}$. My notation for entropy thus becomes $$H_{X\sim P}(X)=-\text{E}_{X\sim P}[\log P(X)]$$

Now comes the point I don't understand about the definition of cross entropy: Why doesn't it reference a random variable $X$? Applying analogous reasoning as above, I would assume that cross entropy has the form \begin{equation}H_{X\sim P}(Q(X))=-\text{E}_{X\sim P}[\log Q(X)]\tag{1}\end{equation} however, Wikipedia makes no mention of any such random variable $X$ in the article on cross entropy. It speaks of

the cross-entropy between two probability distributions $p$ and $q$

which, like the notation $H(P,Q)$, implies a function whose argument is a pair of distributions, whereas entropy $H(X)$ is said to be a function of a random variable. In any case, to take an expected value I need a (function of) a random variable, which $P$ and $Q$ are not.

Comparing the definitions for the discrete case: $$H(p,q)=-\sum_{x\in\mathcal{X}}p(x)\log q(x)$$ and $$H(X)=-\sum_{i=1}^n P(x_i)\log P(x_i)$$

where $\mathcal{X}$ is the support of $P$ and $Q$, there would only be a qualitative difference if the events $x_i$ didn't cover the whole support (though I could just choose an $X$ which does).

My questions boil down to the following:

Where is the random variable necessary to take the expected value which is used to define the cross entropy $H(P,Q)=-\text{E}_{P}[\log Q]$
If I am correct in my assumption that one needs to choose a random variable $X$ to compute the cross entropy, is the notation I used for (1) free of ambiguities.

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2021-03-05 15:17:04

Your notation $H(X)=-\text{E}_P[\log P(X)]$ is really redundant. In general, if $X$ is a random variable, and $g$ is any function, then $E[g(X)]$ is defined without ambiguity, it's not necessary (actually is makes no sense) to specify "with respect to which variable the expectation is taken".

The confusion might arise here because we are dealing with two things, random variables and distributions, and using different letters for them. But, in essence, a random variable is a distribution.

If we understand that $X$ is a rv with distribution $p$ , and the same for $Y$ and $q$, then we can write, unambiguosly (let me use $\tilde H$ for the cross entropy to distinguish it from the joint entropy):

$$\tilde H(p,q) = - E[ \log q(X)] = - \sum_x p(x) \log q(x) $$

If the above causes some confusion, consider that $\log (q(\cdot))$ is just a function, like $\sin(\cdot)$ or $\sqrt{\cdot}$

It mightbe more clear and consistent (but also more verbose) to use $P_X()$ and $P_Y()$ to denote the distributions of random variables $X,Y$.

$$\tilde H(X,Y) = - E[ \log P_Y(X)] = - \sum_x P_X(x) \log P_Y(x) $$

Notice that the lowercase $x$, used in the sums, is a dummy variable (not a random variable!) and we could also use $u$ instead or any other letter.

Edit: An attempt to clarify.

First: if $X$ is a given random variable, then its probability distribution is given, and the expression $E[X]$ is perfectly well defined, with zero ambiguity: eg, in the discrete case, $E[X] = \sum_x x P(x) $ where $P(\cdot)$ is the pmf of $X$. Period. It would be nonsensical to specify "with respect with which variable or distribution" the expectation should be taken.

This is also true when some funcion is applied to the rv (which just produces another rv), as in $E[g(X)]$, or when the variable is multidimensional, as in $E[g(X,Y)]$.

Hence, the notations $E_Q[X]$, $E_Q[g(X)]$ , $E_Q[g(X,Y)]$ (where $Q$ is some distribution) are wrong, they make no sense.

Having said that: sometimes it's not feasible or practical to stick with that notation, where each random variable has a letter like $X$. For example, suppose we want consider the family of zero mean gaussian distributions, parametrized by the deviation $\sigma$, and we want to denote by $r(\sigma)$ the respective differential entropy. Letting $\phi(x;\sigma)$ be the gaussian density, we'd write

$$r(\sigma) = - \int \phi(x;\sigma) \log \phi(x;\sigma) dx$$

Notice that here $x$ is not a random variable, is just a dummy integration variable, and we could replace it with any letter. Now, surely, this is an expectation, and so might want to write something like

$$r(\sigma) = - E[\log \phi(x;\sigma)] \tag2$$

... but this is not right, because the argument of $E[\cdot]$ is not a random variable! We might instead write

$$r(\sigma) = - E[\log \phi(X_{\sigma};\sigma)] \tag3$$

adding the definition: "$X_{\sigma}$ is a rv that follows the distribution $N(0,\sigma^2)$", but this is rather ugly. Hence we often abuse the notation in the following way:

$$r(\sigma) = - E_{\phi(x;\sigma)}[\log \phi(x;\sigma)] \tag4$$

But, be careful, here the $E[]$ notation must be understood not in the probabilistic setting but mere as a functional operator; here random variables are not involved, only functions (some of them, densities or distributions). That is

$$E_{f(x)}[g(x)] \triangleq \int f(x) g(x) dx \tag 5 $$

In this setting, the original notation $H(P,Q)=-\text{E}_P[\log Q]$ is, indeed, totally correct.

And as for the plain entropy, it should either be

$$H(X) = -E [\log P(X)] \tag 6$$

(entropy of a random variable $X$ is a rv which has distribution $P$) or

$$H(P) = -E_P [\log P] \tag 7$$

(entropy of a distribution $P$, functional expection operator, no random variables appear). You are mixing both.

**Bumbble Comm** · Answer 2 · 2021-03-05 16:31:09

Personally, I feel that $H(X)$ is already a (very convenient) abuse of notation. After all, entropy is a function of purely the law of $X$ (and not the values it takes). So the IMO fundamental object is (for discrete laws), $H(p) := - \sum p_x \log p_x$ and writing $H(X)$ for the entropy of the law of a random variable $X$ is a convenient shorthand - more formally I'd write this as $H(X\mu)$, where $\mu$ is the measure of the underlying probability space, and $X\mu$ is its pushforward under $X$ (which is a fancy way of saying the law of $X$). There's an exact parallel for differential entropy, but with the catch that you need to have a base measure on the outcome space for differential entropy to make sense.

With this view, $H(p,q) := -\sum p_x \log q_x$ is a natural definition and notation -you don't really need to invoke a random variable, it's about the laws. In fact, I'd maybe go one step further - cross entropy, much like KL divergence, is not really about random variables, but it's a measure(in the non-mathematical sense) of how well aligned two laws are.

If you want to highlight the asymmetry, there's a couple of options that i've seen floating around -

$H(p\|q)$, which highlights the connection between cross-entropy and KL-divergence - $H(p\|q) = D(p\|q) + H(p)$. The notation is nice because the asymmetry of cross entropy is entirely the same as that of KL divergence.
$H_p(q)$ which is sort of expressing that the 'base measure' is $p$. I find this notation quite natural when I consider operational meanings of cross entropy - e.g. if one designs a good compression scheme for an iid $q$-source, but the actual law on which this code is used were iid $p$, the average code length per symbol one would get is (asymptotically) $H_p(q).$ So in this case the base measure $p$ is what's 'truly happening', while $q$ is what you're trying to 'impose'.

Towards a consistent notation for entropy and cross-entropy

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in PROBABILITY-THEORY

Related Questions in NOTATION

Related Questions in INFORMATION-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions