Mutual Information between discrete and continuous variable in deep learning

271 Views Asked by At

I was trying to replicate an experiment from "When does label smoothing help?" in NeurIPS, 2019. The experiment is related to calculating the mutual information between input image to a neural network $X$ (discrete variable), and the difference between network output logits of two classes $Y$ (Continuous variable). The source of randomness comes from data augmentation in $X$ and we approximate the distribution of $Y$ as a Gaussian and they estimate the mean and variance from the examples using Monte Carlo samples. The approximated formula is given as below where the output should be some value between $0$, and $log(N)$

$\begin{aligned} &I(X ; Y)=E_{X, Y}\left[\log (p(y | x))-\log \left(\sum_{x} p(y | x)\right)\right] \text { and }\\ &\hat{I}(X ; Y)=\sum_{x=1}^{N}\left[-\left(f\left(d\left(\boldsymbol{z}_{x}\right)\right)-\mu_{x}\right)^{2} /\left(2 \sigma^{2}\right)-\log \left(\sum_{x=1}^{N} e^{-\left(f\left(d\left(\boldsymbol{z}_{x}\right)\right)-\mu_{x}\right)^{2} /\left(2 \sigma^{2}\right)}\right)\right] \end{aligned}$

where $\mu_{x}=\sum_{l=1}^{L} f\left(d\left(\boldsymbol{z}_{x}\right)\right) / L, \sigma^{2}=\sum_{x=1}^{N}\left(f\left(d\left(\boldsymbol{z}_{x}\right)\right)-\mu_{x}\right)^{2} / N, L$ is the number of Monte Carlo samples used to calculate the empirical mean and $N$ is the number of training examples used for mutual information estimation.

My questions: 1. I don't really understand how this approximation is done and they didn't explain or cite this part. So, any hints will be appreciated. 2. While replicating the same formula, the output mutual information is something around -4000, I'm not totally sure that the formula is totally correct but substituting some rough estimated numbers inside it won't end up with something in the range from $0$ to $log(N)$. Did they forget some normalization constants in their approximated formula?