Estimating entropy from a set of measurements

1.4k Views Asked by At

Given a set of measurements, such as $528, 412, 281, 66, 338, 249$, is it possible to compute an estimate on how much entropy each measurement provides?

Just to clarify: I seek an estimate for the amount of unpredictability to expect from a measurement, measured in bits (shannons).

I am experimenting with ways to harvest entropy (for seeding a cryptographically secure pseudorandom number generator), and one of my entropy sources provides imprecise measurements of the time a process takes to complete.

The process is the same each time, the measurement should be approximately the same each time, so variation should (by design) mainly be due to measurement error. The smallest theoretically possible measurement is $0$, the largest is unknown.

My math background is limited, so I would appreciate clear and comprehendible explanations.

2

There are 2 best solutions below

2
On BEST ANSWER

As correctly stated in the answer by msm, the solution to this interesting problem might be considerably easier if we could deal with a large number of samples. Regardless of the use of an empirical distribution function or of a distribution directly obtained by row data, when we have a large number of samples and a pdf can be defined, we can rapidly calculate entropy using the standard formula for Shannon entropy.

However, there are two major issues to be considered in this question. The first is that the problem seems to clearly ask for an analysis of entropy on a single, relatively small set of observations taken from a very larger set of possibilities (in this regard, knowing the range within numbers are generated could be useful). So we are working in a context of "undersampled" regime. On the other hand, the conventional Shannon entropy is a measure that is suitable for clearly defined probability distributions. Although sometimes we can make assumptions on the underlying distribution to link our sample dataset to some entropy measure, estimating entropy from a single undersampled set of observations is not easy. In practice, we have an unknown discrete distribution composed by $k $ observations over $N $ different possible outcomes, defined by a probability vector $p=(p_1,p_2,…,p_N) \,\,\,$, with $p_i \geq 0$ and $\sum p_i=1$. Because in most cases the probability vector is unknown, the classical Shannon entropy $$H (p)=-\sum_{i=1}^{N} p_i \log p_i $$ cannot be directly used. So we have to obtain an estimate of$H(p)$ from our dataset of size $k $.

This is why the typical approach to entropy in undersampled sets of observations is based on entropy estimators. These are surrogate measures of entropy that somewhat aim to overcome the drawbacks depending on the small size of our dataset. For example, a very basic (and rarely used) estimator is the so called Naive Plugin (NP) estimator, which uses the frequency estimates of the discrete probabilities to calculate the following surrogate of entropy:

$$\hat {H} (p)=-\sum_{i=1}^{k} \hat {p}_i \log \hat {p}_i $$

where $\hat {p}_i$ is the maximum likelihood estimate of each probability $p_i $, calculated as the ratio between the frequency of the outcome $i $ (i.e. the histogram of the outcomes) and the total number of observations $k$. It can be shown that such estimator largely underestimates $H (p) $.

A number of other estimators has been proposed to improve the performance of the NP estimator $\hat {H}(p) $. For instance, a rather old approach is the Miller adjustment, in which a slight increase in the accuracy of the NP estimator is obtained by adding to $\hat {H}(p) $ a constant offset equal to $(k-1)/(2N)\,\, \,$. Clearly this correction is still rough, because it only took into account the size of the sample, and not its distribution. A more robust modification of the NP estimator can be obtained using the classical jackknife resampling approach, commonly used to assess bias and variance of several types of estimators. The jackknife-corrected version of the NP for a dataset of $k $ observations is

$$\hat {H}_{J}(p)= k \hat {H}(p) - (k-1) \tilde {H}(p) $$

where $\tilde {H}(p) $ is the average of $k $ NP estimates, each obtained by excluding a single different observation. Other robust variants of the NP estimator, more complex, can be obtained using procedures based on analytic continuation. You can find additional details on this issue here.

Recently, a number of other estimators based on different arguments have been proposed. Among these, the most commonly used for discrete distributions are the Nemenman-Shafee-Bialek (NSB), the Centered Dirichlet Mixture, the Pitman-Yor mixture, and the Dirichlet process mixture. These are Bayesian estimators, which then hing on explicitly defined probabilistic assumptions. Similarly, non Bayesian measures have been suggested, such as the Coverage-Adjusted estimator, the Best Upper Bound, or the James Stein estimator. It should be highligted that there is no unbiased estimator in this context, and that the convergence rate of different estimators can vary in a considerable manner, in some cases being arbitrarily slow. However, for the specific question of the OP, which is based on a discrete distribution with finite range, a reasonable choice could be the NSB estimator, which uses an approximately flat prior distribution over the values of the entropy, built as a mixture of symmetric Dirichlet distributions. This estimator shows rapid convergence to the entropy and good performances in terms of robustness and bias. You can find more details on the underlying theory here. Very useful online applications and tools for the calculation of NSB entropy can be found here.

The second issue in this question is that the problem - if I understood correctly - seems to be focused on the amount of entropy related to each single observation, rather than on the whole dataset entropy. While the contribution of each observation is easy to determine in conventional Shannon entropy calculations, this is more challenging for other estimators. A typical approach to simplify this problem, commonly used in many other statistical fields, could be to calculate the entropy estimator for the whole dataset after removal of the observations of interest, and then compare it with the entropy estimator for the whole dataset. The difference can be used as a measure of entropy contribution related to that specific observation. Applying such approach for the NSB estimator, or alternatively for a relatively robust NP-related estimator (e.g., the jackknife-corrected) might be a good choice to answer the specific question reported in the OP.

4
On

Based on the comments, the problem is formulated as:

We are given $n$ samples of a stochastic process $X(t)$ where the random variables $X_i=X(t=t_i)$ have identical distribution and are independent and the entropy of random variables $X_i$ is desired.

The problem becomes easy by assuming all $X_i$ have the same distribution. It boils down to estimating the unknown distribution. This is because all samples can be thought to be drawn from the same distribution.

Just a few samples, without any prior knowledge of the source type won't help very much. If we could assume a distribution that source follows, then using the samples one could estimate the parameters.

In this particular case, the standard approach I think is building up an empirical distribution function. For that to work, we need a large number of samples. Let's assume we have $n$ samples of a random variable $X$ whose cdf is unknown and is denoted by $F(x)$. The idea is to estimate $F(x)$ as $\hat{F}_n(x)$ by only looking at $n$ available samples:

$$\hat{F}_n(x)=\frac{1}{n}(\text{number of samples that are less than }x )$$

As a result of LLN, we have $\hat{F}_n(x)\to F(x)$ almost surely, as $n$ tends to infinity.

From the empirical cdf, we can derive an empirical pmf $p(x)$ and calculate the entropy

$$H=-\sum_x{p(x)\log(p(x))}$$

But this is actually the entropy of all random variables $X_i$ since they all share the same pmf.