Suppose we have two independent variables X and Y. Intuitively the mutual information, I(X,Y), between the two should be zero, as knowing one tells us nothing about the other.
The math behind this also checks out from the definition of the mutual information (https://en.wikipedia.org/wiki/Mutual_information).
Now let us compute it actually. First generate two simple random vectors of length 10 in R:
X=sample(seq(1,100),10)
Y=sample(seq(1000,10000),10)
I got these:
X={3, 35, 93, 13, 90, 89, 34, 97, 49, 82}
Y={7611, 5041, 2612, 4273, 6714, 4391, 1000, 6657, 8736, 2443}
The mutual information can be expressed with the entropies H(X), H(Y) and the joint entropy between X and Y, H(X,Y)
I(X,Y) = H(X) + H(Y) - H(X,Y)
Moreover
H(X) = -10*[(1/10)*log(1/10)] = log(10)
since each observation occurs only once and thus has a frequency of 1/10 of occurring. The maximum entropy for a random variable of length N is log(N) so this calculation checks out.
Similarly
H(Y) = log(10)
The joint entropy is similar to the individual entropies but this time we count the frequencies of pairs occurring. For example the pair {X=3,Y=7611} occurs only once out of a total of 10 paired observations, hence it has a frequency of 1/10. Therefore:
H(X,Y) = -10*[(1/10)*log(1/10)] = log(10)
since each paired observation occurs only once.
So
I(X,Y) = log(10) + log(10) - log(10) = log(10)
which is clearly non-zero. This is also the result that various R packages (e.g. infotheo) produce.
The question is where is the mistake in my thinking? Why is I(X,Y) not zero?
Notice how in the formula of Mutual Information there are probabilities, not frequences. A frequency is just an approximation of probability, and with a sample so small, you get very inaccurate approximations, hence the result.
In order to calculate the Mutual Information of a discrete random variable X uniformly distributed over [1,100] and an independent random variable Y uniformly distributed over [1000, 10000], you calculate:
H(X) = -100*[(1/100)*log(1/100)] = log(100)
H(Y) = -9001*[(1/9001)*log(1/9001)] = log(9001)
H(X,Y) = -(900100)*[(1/900100)*log(1/900100)] = log(900100)
I(X,Y) = log(100) + log(9001) - log(900100) = 0
What you have actually calculated is the Mutual Information of two discrete random variables with the following joint probability distribution:
p(3, 7611) = 0.1
p(35, 5041) = 0.1
p(93, 2612) = 0.1
p(13, 4273) = 0.1
p(90, 6714) = 0.1
p(89, 4391) = 0.1
p(34, 1000) = 0.1
p(97, 6657) = 0.1
p(49, 8736) = 0.1
p(82, 2443) = 0.1
These variables are not independent; in fact, knowing one of the values is sufficient to find the other one. That is why their Mutual Information is not zero.