I have a categorical random variable that can take on 3 values $$S:\{A,B,C\} \rightarrow [0,1]$$ We can describe its PMF using 3 numbers $$(p(S=A), p(S=B), 1-p(S=A)-p(S=B))$$ Suppose $S$ is dependent on some other random variable $\theta$, and that we can evaluate $P(S|\theta)$. We can then consider $$ p(S) = E_{\theta\sim p(\theta)}[p(S|\theta)]\\ =\int p(\theta) p(S|\theta) d\theta\\ \approx \frac{1}{M}\sum_{i=1}^M p(S| \theta), \theta \sim p(\theta) $$ I have a sample of $\theta$ (of size $M$) that induces a sample of PMFs. For example, in case 1 I have ($M=5$) $$[(0.3,0.3,0.4), (0.3,0.3,0.4), (0.3,0.3,0.4), (0.3,0.3,0.4), (0.3,0.3,0.4)]$$ In case 2 I have (replace 0 with 1e-8 for input to log) $$[(0.0,0.8,0.2), (0.0,0.1,0.9), (0.9, 0.1, 0.0), (0.1, 0.1, 0.8),(0.2, 0.7,0.1)]$$
My question is how do I characterize the spread of the sampled distributions? Intuitively, case 1 has no spread while case 2 has some spread. Below are some practical options and my thoughts on them. I would like help formulating them more rigorously. Preferably, the characterization would be based on $S$ without needing to refer to having $\theta$ samples since that's more of a limitation of our evaluation (can't evaluate $p(S)$ directly) and implementation detail.
variance of the entropy of each sampled distribution. We introduce another random variable $X = H(S|\theta)$, and that we want $Var(X)$: $Var(H(S|\theta_i))$ (entropy in nats)
- case 1 $[H(S|\theta_i)] = [1.08889998, 1.08889998, 1.08889998, 1.08889998, 1.08889998]$, $Var(H(S|\theta)) = 0$
- case 2 $[H(S|\theta_i)] = [0.50040262, 0.32508318, 0.32508318, 0.63903189, 0.80181857]$, $Var(H(S|\theta)) = 0.034$
full entropy of S $$H(S) = -\sum_{s \in \{A,B,C\}} p(S) \log(p(S))\\ \approx -\sum_{s \in \{A,B,C\}} [\frac{1}{M}\sum_{i=1}^M p(S| \theta)] \log([\frac{1}{M}\sum_{i=1}^M p(S| \theta)]) $$
- case 1 $\frac{1}{M}\sum_{i=1}^M p(S| \theta) = (0.3, 0.3, 0.4), H(S) = 1.0888999753452238$
- case 2 $\frac{1}{M}\sum_{i=1}^M p(S| \theta) = (0.24, 0.36, 0.4), H(S) = 1.0768186685608336$
bounding the entropy of the distribution parameters. The distribution of $S$ is parameterized by $\rho$ which are the 2 numbers that describe the degrees of freedom it has $\rho_0 = p(S=A), \rho_1 = p(S=B)$. These parameters are a function of the random variable $\theta$, and so we can seek to bound the entropy of $p(\rho | \theta)$. From the sampled $\theta$, we also get a sample of $\rho$. We know that a gaussian distribution has the maximum entropy for a given variance, and its differential entropy is $ {\displaystyle {\frac {1}{2}}\ln \det \left(2\pi \mathrm {e} {\boldsymbol {\Sigma }}\right)}$. We evaluate the 2x2 covariance ${\boldsymbol {\Sigma }}$ from the samples and thus have a bound on the entropy of $p(\rho | \theta)$
- case 1 $\Sigma = \begin{bmatrix}0 & 0\\0 & 0\end{bmatrix}, \det \Sigma = 0$
- case 2 $\Sigma = \begin{bmatrix}0.143 & -0.048\\-0.048 & 0.128\end{bmatrix}, \det \Sigma = 0.016$
Option 1 and 3 matches my requirements. Option 3 additionally doesn't have aliasing issues with $(0,1,0)$ and $(1,0,0)$ having the same entropy but being different distributions. I will be using Option 3 for now, but are there any existing methods to get what I want?
numpy code for testing
import numpy as np
a = np.array([(0.3, 0.3, 0.4), (0.3, 0.3, 0.4), (0.3, 0.3, 0.4), (0.3, 0.3, 0.4), (0.3, 0.3, 0.4)])
b = np.array([(0.0, 0.8, 0.2), (0.0, 0.1, 0.9), (0.9, 0.1, 0.0), (0.1, 0.1, 0.8), (0.2, 0.7, 0.1)])
epsilon = 1e-8
b = b + epsilon
b = b / (1 + epsilon * 3)
# entropy of each sampled distribution
entropy_a = -np.sum(np.multiply(a, np.log(a)), axis=1)
entropy_b = -np.sum(np.multiply(b, np.log(b)), axis=1)
print(entropy_a)
print(entropy_b)
# variance of the sampled entropy
print(np.var(entropy_a))
print(np.var(entropy_b))
# full estimated entropy
p_s_a = np.mean(a, axis=0)
p_s_b = np.mean(b, axis=0)
full_entropy_a = -np.sum(np.multiply(p_s_a, np.log(p_s_a)))
full_entropy_b = -np.sum(np.multiply(p_s_b, np.log(p_s_b)))
print(full_entropy_a)
print(full_entropy_b)
# only 2 degrees of freedom / sufficient statistics needed for the distribution
rho_a = a[:, :2]
rho_b = b[:, :2]
sigma_a = np.cov(rho_a, rowvar=False)
sigma_b = np.cov(rho_b, rowvar=False)
print(np.linalg.det(sigma_a))
print(np.linalg.det(sigma_b))