I'm reading a paper, 'Mutual Information Neural Estimator.' In this paper, the notation of mutual information is written as, $$ I(X;Z) = \int_{\mathcal{X} \times \mathcal{Z}} \log \frac{d \mathbb{P}_{XZ}}{d \mathbb{P}_{X} \otimes d \mathbb{P}_{Z}} d \mathbb{P}_{XZ}.$$ Similarly, the KL divergence is written as , $$D_{KL} (\mathbb{P} \vert \vert \mathbb{Q}) = \mathbb{E}_\mathbb{P} \Bigl[ \log \frac{d \mathbb{P}}{d \mathbb{Q}} \Bigr].$$
But I'm familiar with other notations like(from Wikipedia)
MI for continuous variable,
$$ I(X;Z) = \int_\mathcal{z} \int_\mathcal{x} P_{(X,Z)} (x,z) \log \Big( \frac{P_{(X,Y)}(x,y)}{P_X (x) P_Y (y)} \Bigr) dx dy, $$
and KL divergence
$$D_{KL} (P \vert \vert Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}.$$
My questions are,
- why they use $\text{\mathbb}$ format like $\mathbb{P}_X $ instead of $P(x)$?
- And why is it $d \mathbb{P}_X $, not just $\mathbb{P}_X $?
- Why $\sum P(x) $ become expectation?
I think the notation in this paper is quite confusing. Is there any reason they use this types of notation?