How to calculate the entropy of a random vector $X = (C_1,\dots,C_n,D_1,\dots,D_m)$ where $C_i$ are continuous margins and $D_i$ are discrete margins?

697 Views Asked by At

Let $X := (C,D) = (C_1,\dots,C_n,D_1,\dots,D_m)$ be a random vector from a mixed continuous-discrete distribution, meaning that $X$ takes values in

$$X \in \mathbb{R}^n \times \mathbb{N}^m$$

with $C = (C_1,\dots,C_n) \in \mathbb{R}^n, D = (D_1,\dots,D_m) \in \mathbb{N}^m$

You can think of $X$ as a row from a generic dataset. Maybe it's storing some data about individuals, and contains some

  • discrete values like occupation, sex, residence, country of birth, etc.
  • and continuous values like date of birth, height, weight, salary, etc.

How to calculate (or estimate) the entropy of $X$, $H(X)$?

If $X$ was a purely discrete random vector, then its entropy would be

$$H_{\text{discrete}}(X) = -\sum_{x \in \text{Dom}(X)} \mathbb{P}(X=x) \log_2(\mathbb{P}(X=x))$$

Or if $X$ was a purely continuous random variable, then its entropy would be

$$H_{\text{continuous}}(X) = -\int_{\text{Dom}(X)} f_X(x) \log_2(f_X(x))dx$$

(Where $f_X(x)$ is the p.d.f. of $X$.)

It's not as simple as "adding up a sum and an integral" (something like $-\sum \mathbb{P}(C) \log_2(\mathbb{P}(C)) - \int f_D(x) \log_2(f_D(x))dx$), because correlations between $C_i$ and $D_j$ wouldn't be taken into account.

I have found the following study that deals with the case of $X \in \mathbb{R} \times \mathbb{N}$, or the case where $n=1$ and $m=1$. Is there a study that deals with the general case?

Edit: I'm surprised this doesn't exist in the literature, because databases with such rows are extremely common. In fact it's hard to find databases that only have discrete or continuous columns.

1

There are 1 best solutions below

1
On BEST ANSWER

Since $X := (C,D) = (C_1,\dots,C_n,D_1,\dots,D_m)$ you can expand using the chain rule $$ H(X)=H(C,D)=H(C|D)+H(D). $$ I would suggest the expansion above as opposed to $H(C,D)=H(D|C)+H(C)$ is probably the better from a computational point of view since $D$ is discrete, thus you can express the conditional entropy on the RHS as $$ H(C|D)=\sum_{d \in \text{Dom}(D)} H(C|D=d)\mathbb{P}(D=d). $$