I would like to measure the entropy of a set of real numbers. I am wondering if can come up with a reasonable estimate by using the first $n$ moments.
Edit 1: It looks like the InfoGAN network estimates mutual information by way of "variational arguments." What does that mean? I think this is getting to my problem which is how to compute entropy when the data is not discrete. Any thoughts?
There is no analytical solution for estimating entropy of a continuous random variable based on its first $m$ statistical moments, given that heterogenous distributions with the same mean and variance can have different entropies, but there are some interesting linkages between
Entropy's similarity with variance
Density function $f(x)$ can be approximated by a series of Legendre polynomials, $G_i(x), i=1,\dots,N$, using the Legendre expansion (similar to Taylor expansion) as
$$f(x) \approx \alpha_0 G_0(x) + \alpha_1 G_1(x) + \dots + \alpha_N G_N(x),$$
$$G_0(x) = 1, \hspace{1cm} G_1(x) = x, \hspace{1cm} G_2(x) = \frac{1}{2} (3x^2 -1)$$
Rearranging the third polynomial and substituting the first two gives $x^2~=~\frac{1}{3} \Big[2 G_2(x) + G_0(x) \Big] $
Based on these expressions, variance equals
$$\sigma^2(x) = \int_{-\infty}^{\infty} x^2f(x) dx \approx \frac{1}{3} \Bigg[\frac{4}{5} \alpha_2 + 2 \alpha_0\Bigg]$$
while the partial derivative of entropy w.r.t. $\alpha_2$ is
$$\frac{\partial H}{\partial \alpha_2} = - \int_{-\infty}^{\infty} G_2(x) \log\Bigg[ \alpha_0 G_0(x) + \alpha_1 G_1(x) + \dots + \alpha_N G_N(x) \Bigg] dx $$
This expression relates the variation of entropy with the variation of variance, but also shows that entropy is dependent on (more) higher-order moments (than variance), making it a closer representation of the true probability distribution.
Conditional entropy acts as a lower bound of MSE
The conditional differential entropy, $H(X|Y)$, yields a lower bound on the expected squared error of an estimator based on the Gaussian density. For any random variable $X$, observation $Y$ and estimator $\hat{X}$ the following holds:
$$\mathbb{E}\left[\bigl(X - \hat{X}{(Y)}\bigr)^2\right] \ge \frac{1}{2\pi e}e^{2\times H(X|Y)}$$
Our predictor knows that $X$ has a certain amount of information about $Y$, but the lower bound on the right hand side of the inequality indicates that a different predictor could work better (to attain the true, higher value on the left hand side).
In turn, there is a lower bound on mutual information (MI), recalling that $I(X,Y) = H(X)-H(X|Y)$. Fano's inequality explains this for the discrete case, while similar bounds can be derived for the continuous case. One example would be the following: \begin{align} I(X,Y) &= H(Y)-H(Y|X) \\&= H(Y) + \mathbb E_p \log p(y|x) \\ &\geq H(Y) + \mathbb E_p \log q(y|x) \\ &= H(Y) - 1/2 \log \mathbb E_p[(Y-f(X))^2] -1/2\log (2 \pi e) \end{align} In the second line, we use the non-negativity of KL divergence, introducing some variational distribution, $q$. In the third line, we make a specific choice for $q(y|x) \sim \mathcal N(f(x), E_p[(Y-f(X))^2])$.
$\mathbb E_p$ is the expectation and the sample expectation would just be $\mathbb E_p[ (Y-f(X))^2] = 1/m \sum_{i=1}^m (Y_i-f(X_i))^2$, which is mean squared error (MSE). In other words, mutual information is bounded by prediction error, being MSE for the instilled Gaussian case.
The smaller the regression error, the larger the lower bound on mutual information will become, i.e. higher and closer to true MI. Through different choices of the variational distribution $q$, we can get different bounds.