This might not be mutual to mathematics but it does relate to Information-Theory.
My question is:
Does the InformationGain algorithm, in Decision-Tree machine-learning, favor a high-entropy attribute or a low-entropy one?
The source of my confusion is in the definition of Shannon's Function:
H = -SUM(pi*log2(pi))
/\--this MINUS-right here!
If this is the case then SURELY: gain = Hbefore - Hafter
Actually, means:
gain = Hbefore + Hafter
??... No, then have people just forgotten about the MINUS-sign??
The minus sign is NOT a subtraction. It is negative. The reason a negative sign is there is because we are taking logarithm of probabilities.
Do it in your calculator, what is the base 2 logarithm of 0.5? That's right, it is -1. In order to make the information of a random variable that take 50% chance 0 and 50% chance 1, we need to take the negative of the logarithm to make it work.