Heuristic explanation of channel capacity for discrete noiseless channels and binary symmetric channel

223 Views Asked by At

I am reading the part of Cover-Thomas's book about Information Theory dealing with channel capacity. I get the definitions and the mathematical part, but I'm not sure I get the idea behind the definition of channel capacity.

He defines this quantity as

$$C= \max_{p(x)} I(X,Y) $$

where $p(x)$ is the input probability distribution and $I$ the mutual information. So, what is the idea behind this?

As I've understood, the mutual information is a way to measure how much you know of $X$ given $Y$ or viceversa, so basically how much information you know about one variable knowing the other. So the idea would be to take the best probability distribution for $X$ in order to maximaze this quantity, because this would mean that by knowing $Y$ you know $X$ at the best of your possibility. Then I'll have to see how we can use this to transmit codes without errors, but for now I'm interested in understending the basic concept and being able to apply it to simple exercises.

So for example, let's take the binary symmetric channel as an exercise. Suppose there are no errors with probability $1-p$ (0 in 0 and 1 in 1) and there's an error with probability $p$. Then the transition matrix $P=p(y|x)$ will be given by

$$ P = \begin{bmatrix} 1-p & p \\ p & 1-p \end{bmatrix}$$

Consider then

$$I(X,Y)= H(Y) - H(Y|X)= -p(Y=0) \log_2 (p(Y=0)) - p(Y=1) \log_2 (p(Y=1)) - H(Y|X)$$

Now I know that given $p(x)$ and the transition matrix $P$ I can get $p(y)$ by multiplying the two, but how do I maximize over $p(x)$? I mean, in the book they do it very fast but I guess it's because they knew the result in advance. Also, they say the result is $1 - H(p)$, but what do they mean by thaat?

1

There are 1 best solutions below

0
On

This answer just summarizes my comments: $H(p)$ denotes the entropy of a Bernoulli-$p$ variable $$ H(p) = p \log_2(1/p) + (1-p)\log_2(1/(1-p))$$ Observe that $H(p)$ is symmetric about $p=1/2$ and $H(0)=H(1)=0, H(1/2)=1$.

For the BSC with error probability $p$ we have $H(Y|X)=H(p)$ regardless of the value of $P[X=0]$: \begin{align} H(Y|X) &= -\sum_{x=0}^1\sum_{y=0}^1 P[X=x]P[Y=y|X=x]\log(P[Y=y|X=x])\\ &=\sum_{x=0}^1P[X=x]\underbrace{\left[-\sum_{y=0}^1 P[Y=y|X=x]\log(P[Y=y|X=x])\right]}_{-p\log(p)-(1-p)\log(1-p) = H(p)}\\ \end{align} Then $$I(X,Y) = H(Y) - H(Y|X)=H(Y)-H(p)$$ The largest $H(Y)$ can be is 1, and this can be achieved by $P[X=0]=P[X=1]=1/2$ (which means $P[Y=0]=P[Y=1]=1/2$). So the uniform distribution on $X$ maximizes the mutual information $I(X,Y)$, with max value $1-H(p)$.

For intuition, we see that if $p=0$ or $p=1$ we get a perfect deterministic channel that can send at capacity $1-H(p)=1$ bit per channel use.

If $p=1/2$ then the output of the channel is independent of the input and the capacity is $1-H(1/2)=0$.