The formula I was given for calculating information for a specific stimulus $s_x$ is: $$I(R,s_x) = \sum_i p(r_i|s_x) \log_2{p(r_i|s_x)\over p(r_i)} $$
It was also said that information is always non-negative. My understanding is that the logarithm component represents the information for the stimulus, and the other $p(r_i|s_x)$ is multiplied by it and summed over all $i$ to obtain an average value.
My confusion is this... surely we can imagine a scenario where the probability distribution $p(r_i)$ is wider than $p(r_i|s_x)$ for a given $s_x$. In the cases where the conditional probability is less than the unconditional, the logarithm expression is negative....?
Is the non-negativity of mutual information only when averaged across all responses (e.g. when averaged across all $r_i$) ?
Look up Kullback-Leibler (KL) divergence en.wikipedia.org/wiki/Kullback–Leibler_divergence denoted $D(P||Q)$ that measures the amount of "difference" between two distributions $P,Q$. Note the proof that it is always non-negative for any two distributions. Your quantity can be recognized as the KL divergence between the conditional distribution and the original distribution. Some terms in the sum for the formula might be negative but the overall sum is always non-negative, and is $0$ if and only if the conditional distribution and the original distribution are the same.