How to calculate entropy from a set of samples?

9.3k Views Asked by At

entropy (information content) is defined as:

$$ H(X) = \sum_{i} {\mathrm{P}(x_i)\,\mathrm{I}(x_i)} = -\sum_{i} {\mathrm{P}(x_i) \log_b \mathrm{P}(x_i)} $$

This allows to calculate the entropy of a random variable given its probability distribution.

But, what if I have a set of scalar samples and I want to calculate their entropy? In this case the probability density function is not available, but maybe there is a formula to get an approximation (as in the sample mean)? Does it have a name?

1

There are 1 best solutions below

2
On

The most natural (and almost trivial) way to estimate (not calculate) the probabilities is just counting: $$\hat{p_i}=\frac{n_i}{N}$$ where $p_i$ is the probabilty of symbol $i$, $\hat{p_i}$ its estimator, $n_i$ the counting of ocurrences of symbol $i$, and $n$ the total of samples. Then you plug this estimator into the entropy formula.

However, this might not be a fair estimator of the entropy rate of your source, because it does not take into account the dependencies between succesive symbols. It only makes sense if the source emits independent symbols - or if you are only interested in the marginal entropies (and provided that your source is stationary - ergodic, actually).