Proof that entropy of Markov Source is $H(S) = \sum_s P(s) H(s)$

424 Views Asked by At

Let $S$ be a markov source in which distribution $P$ doesn't depend on $P_0$ and which generates numbers from finite set $\Omega$. Proof that $$H(S) = \sum_{s \in \text{process states}} P(s)H(s) $$ where $$H(s) = \sum_{x\in\Omega} P(x| s) \log P(x | s)$$

My attempt

According to the definition of the entropy, for discrete random variable $X$ with possible occurrences $x_1, ..., x_n$ it is $H(X) = \sum_{i=1}^{n} P(x_i)\log P(x_i)$. So in my case, $X$ would be $S$ and $x_i, ... x_n$ would be $s_1, ..., s_n$. Then: $$H(X) =H(S)= -\sum_{i=1}^n P(s_i) \log P(s_i) $$

So I want that $ - \log P(s_i) = H(s_i)$ But I don't know how to proceed with that further. I also looked for that in some books but I didn't find any proof.

1

There are 1 best solutions below

6
On BEST ANSWER

A random process $X_1,X_2,\cdots$ is associated with an entropy rate, defined as \begin{eqnarray*} \mathcal{H} &=& \lim_{n \rightarrow \infty} \frac{1}{n}H(X_n, X_{n-1}, \cdots, X_1) \end{eqnarray*} (when the limit exists). If the process is stationary, as it appears to be in your example, then you can use an equivalent definitions for entropy rate: \begin{eqnarray*} \mathcal{H} &=& \lim_{n \rightarrow \infty} H(X_n | X_{n-1}, \cdots, X_1) \\ \end{eqnarray*}

The definition that makes sense to use in the case of a stationary Markov process is the second one. In your example, it appears that you are assuming a Markov source of order 1 (the current symbol depends on only one past symbol). In that case \begin{eqnarray*} H(X_n | X_{n-1}, \cdots, X_1) = H(X_n | X_{n-1}) = H(X_1|X_0) \end{eqnarray*} where the last equality follows from the stationarity of the process. Then the equation for the entropy rate that you mentioned follows trivially, as \begin{eqnarray*} H(X_1|X_0) &=& \sum_{x_0} p(x_0) \sum_{x_1} p(x_1|x_0) \log \frac{1}{p(x_1|x_0)} \\ &=&\sum_{x_0} p(x_0) H(X_1 | X_0 = x_0) \end{eqnarray*}

Now, you have one more additional condition in your problem which is that "$P$" does not depend on "$P_0$". It isn't immediately clear what exactly you mean, but with high likelihood, you are trying to state that the stationary distribution of the Markov chain does not depend on the distribution of the initial state. This condition is not strictly needed to obtain the entropy formula above, however that condition is often very useful when one wants to study how the probability of long sequences is associated with the entropy.

If the stationary distribution does not depend on the initial conditions, the Markov chain is "ergodic". One of the consequences of ergodicity is that empirical averages converge to the expectation with probability 1, as the number of samples being averaged goes to infinity; specifically if $E[X_0] < \infty$, for a stationary and ergodic random process,

\begin{eqnarray*} \lim_{n \rightarrow \infty} \frac{1}{n} \sum_i f(X_i) = E[X_0] \end{eqnarray*}

with probability 1. If you plug in $f = \log P_{X_1|X_0}$, then you obtain that

\begin{eqnarray*} \lim_{n \rightarrow \infty} \frac{1}{n} \sum_i -\log P_{X_1|X_0}(X_i | X_{i-1}) = E[-\log P(X_1|X_{0})] = H(X_1|X_0) \end{eqnarray*} with probability 1. So ergodicity makes entropy meaningful - it can be used to estimate the probability of long sequences of symbols, regardless of how the sequence starts.

For further references, take a look at the classical book Elements of Information Theory by Cover and Thomas.