Information content of a categorical variable

119 Views Asked by At

The information content of an outcome of a variable is $h(x_i)=-\log_2 (p(x_i))$. I am interested in using this concept to provide more insight on the differences in the distributions of two independent variables.

I am aware of entropy and relative entropy for comparing the expected value of these variables, and I'm using these as needed. However, I am looking for a metric that captures the shape of the distributions that would be graphically captured by a histogram. Is there any reason that I can't take the sum of information content values for the various outcomes of a variable? Effectively, calculating: $$ h(X)=\sum(-log_2 (p(x_i))$$

These summed information content values give a measure of the shape of the distribution.

I've tried looking for examples of people using this metric, but have come up empty. I'm not sure what it would be called -- presumably just the information content of the variable. I can't see any mathematical reason why this can't be done, but I thought someone here might be able to fill me in if there's something I'm missing.

2

There are 2 best solutions below

0
On

You'd need to justify why you believe that that value is useful, and why it gives

a measure of the shape of the distribution.

Anyway, if the alphabet size is $n$ we have

$$\begin{align} \sum_x -\log p(x) &= \sum_x \log (1/p(x))\\ &= \sum_x (\log \frac{1/n}{p(x)} + \log(n)) \\ &= n \log n + n\sum_x \frac{1}{n}\log \frac{1/n}{p(x)} \\ &= n \left(\log n + KL(u || p(x)\right) \end{align} $$

hence the value depends directly on the KL divergence (relative entropy) between an uniform and the given distribution.

Furthermore, for a nice finite entropy distribution with infinite values, like the geometric ($p(x_i) = 2^{-i}$), the value is infinite - which is not very nice.

0
On

Maybe the Shannon entropy, or weighted average of the information content of outcomes of a probability distribution, would be useful. It measures the amount of uncertainty associated with a distribution, so it counts as a proxy for its "shape". When well-defined, the uniform distribution has the highest entropy, while the Dirac distribution has an entropy of zero, i.e. the lowest possible.