I'm aware that by definition the Mutual Information (MI) should be non-negative, and there are two related questions here: (1) and (2).
However, I can think of an example in Physics where it is (or at least it seems to be) negative. So, please, point to where my reasoning is wrong.
First, write the $MI$ as $$ MI(Y;X) = H(Y) - H(Y|X)\ , $$ where $X\to$ input signal; $Y\to$ output signal. The discrete stochastic variables $X$ and $Y$ keep jumping between each of its $k$-th state over time (different states for either $X$ or $Y$).
Say I simulated the stochastic process $Y$ with and without the input $X$. For the sake of the example, consider that $Y$ has 10 states. Say without input, the distribution $\mathcal{P}_0(Y_k)$ that I observed in the simulations is a power law $$ \mathcal{P}_0(Y_k)=\frac{k^{-1}}{\sum_{k=1}^{10}k^{-1}}\ ; $$ whereas with input, I observed a uniform distribution $\mathcal{P}_X(Y_k)$ $$ \mathcal{P}_X(Y_k)=1/10\ . $$
Now, the first term in the $MI$ formula is simply $$ H(Y)=-\sum_{k=1}^{10} \mathcal{P}_0(Y_k) \log_2 \mathcal{P}_0(Y_k)\approx 1.9938~{\rm bits}\ , $$ and the entropy given input is (at least I'm assuming it is empirically given by) $$ H(Y|X)=-\sum_{k=1}^{10} \mathcal{P}_X(Y_k) \log_2 \mathcal{P}_X(Y_k)\approx 2.3026~{\rm bits}\ , $$ which yields $MI(Y;X)\approx-0.3088~{\rm bits}<0$.
I suspect the problem is in this last step. In other words, can I really apply this entropy formula to $\mathcal{P}_X(Y_k)$?
I assume I can, since I measured $Y$ for a long time while the input $X$ was being delivered to $Y$. Or can't I? And if I can't, then why is it and how should I go about it?
Consider I have access only to the distribution of the output states $\mathcal{P}_X(Y_k)$, and the input $X$ is a Poisson process. How should $H(Y|X)$ be written?
PS: this example is illustrative, but this is basically what happens in the mean-field simulations of systems in the directed percolation universality class (such as the mean-field contact process, or any mean-field branching process): the distribution of states when input tends to zero is power law (at the critical point for the branching ratio parameter), and the distribution of states turns into uniform as Poisson input rate increases.
You're not computing the right thing at all.
The conditional entropy of $Y$ given $X$ is defined as $$ - \sum_x P_X(x) \sum_y P_{Y|X}(y|x)\log P_{Y|X}(y|x). $$ This is necessarily smaller than $H(P_Y) = - \sum P_Y(y) \log P_Y(y)$, where note that we must compare the marginal over $Y$ of the joint law of $(X,Y)$, i.e., that $P_Y$ must satisfy that for each $y, P_Y(y) = \sum_x P_X(x) P_{Y|X}(y|x)$. Note that $H(Y) \ge H(Y|X)$ is a mathematical certainty, and is just a consequence of convexity.
In the question, compare your computation of $H(Y|X)$ with the formula above, and I guess you'll see the gap. Effectively, you have just described two laws on a random variable $Y$, $P^1,P^2,$ and say that the entropy under one law is larger than the entropy under the second law. This is not surprising at all, I hope :).