I am a biologist, and i run into a mathematics problem that i am not really sure about.
What i am trying to do is i want to calculate information gain from a lot of binary splits.
So information gain is $I = H_{p} - p_{11}H_{c1} - p_{c2}H_{c2}$ in case of a binary split, where indexes $p$ $c1$ and $c2$ are parent, child1 and child2 respectively, $H$ is the Shannon-entropy and $p$ is the probability of a given subset in either $c1$ or $c2$.
Now if i am correct $H$ is always positive, the probabilities are always positive, thus Information Gain cannot be higher than $H_{p}$, and information gain is highly dependent on the size of the set.
In my case i want to compare different sets with different values in them using information gain, so i thought i ll normalize it, so in case of a perfect split (where the entropy of the child nodes are 0) i ll get 1, and in case of a totally random distribution (child entropy is 1) i ll get 0.
Lets assume i have my "real" $r$ values and "decoy" $d$ values in my initial parent set. For any $c$ cutoff value i normalized the amount of values that are above the cutoff and the ones that are below the cutoff with the sum value for the entropy calculation as follows (the index $+$ indicates the partial set of the real or decoy set that has a higher value than $c$, sign $-$ indicates a lower value than $c$):
$$ H_{+} = -\frac{d_{+}}{(d_{+}+d_{-})} * log_{2}(\frac{d_{+}}{(d_{+}+d_{-})}) - \frac{r_{+}}{r_{+} + r_{-}}*log_{2}(\frac{r_{+}}{r_{+} + r_{-}}) $$
and
$$ H_{-} = -\frac{d_{-}}{(d_{+}+d_{-})} * log_{2}(\frac{d_{-}}{(d_{+}+d_{-})}) - \frac{r_{-}}{r_{+} + r_{-}}*log_{2}(\frac{r_{-}}{r_{+} + r_{-}}) $$
Is this the correct way?