I'm going through Python Machine Learning and I'm at the Gini impurity sections, where they define Gini Impurities as
$I_g(t) = \sum_{i=1}^c p(i|t) (1 - p(i|t))$
where p is the proportion of samples that belong to a class c for a particular node t. Fine, seems reasonable enough. But then they go on to simplify the formula into this:
$I_g(t) = 1 - \sum_{i=1}^c p(i|t)^2$
And I cannot, for the life of me, figure out how they arrived at this example. Am I making some incorrect assumptions as to how p(i|t) works? Can I not tokenize p(i|t) like any general variable?
Note that $\sum_{i=1}^C P(i|t)=1$, that is how the $1$ is obtained in the simplification.
\begin{align}I_g(t) &= \sum_{i=1}^c p(i|t) (1 - p(i|t)) \\&= \sum_{i=1}^c (p(i|t) - p(i|t)^2) \\&= \sum_{i=1}^c p(i|t) - \sum_{i=1}^cp(i|t)^2\\&=1- \sum_{i=1}^cp(i|t)^2 \end{align}