I have a question regarding the GINI impurity that we use while building decision trees.
From the wiki page on gini impurity,
Gini impurity can be computed by summing the probability ${\displaystyle p_{i}}$ of an item with label $i$ being chosen times the probability $${\displaystyle \sum _{k\neq i}p_{k}=1-p_{i}}$$ of a mistake in categorizing that item
To compute Gini impurity for a set of items with ${\displaystyle J}$ classes, suppose ${\displaystyle i\in \{1,2,...,J\}} $, and let ${\displaystyle p_{i}}$ be the fraction of items labeled with class ${\displaystyle i} $ in the set.
$$I_{G}(p)=\sum _{i=1}^{J}p_{i}\sum_{k\neq i}p_{k}$$
$$I_{G}(p)=\sum_{i=1}^{J}p_{i}(1-p_{i})$$
$$I_{G}(p)=\sum _{i=1}^{J}(p_{i}-{p_{i}}^{2})$$
$$I_{G}(p)=\sum_{i=1}^{J}p_{i}-\sum _{i=1}^{J}{p_{i}}^{2}$$
$$I_{G}(p)=1-\sum _{i=1}^{J}{p_{i}}^{2}\tag{1}\label{1}$$
In the first step-
$$I_{G}(p)=\sum _{i=1}^{J}p_{i}\sum_{k\neq i}p_{k}$$,
Instead of combining the two summations, I want to proceed in an alternative way that I have included below, but I end up getting a weird result
Since $\sum_{i}^{J}p_{i}=1$,
$$I_{G}(p)=\sum _{i=1}^{J}p_{i}\sum_{k\neq i}p_{k}=(\sum _{i=1}^{J}p_{i})(\sum_{k\neq i}p_{k}) = \sum_{k\neq i}p_{k}$$
$$I_{G}(p)= \sum_{k\neq i}p_{k}= 1-p_{i}$$
Using the results from equation $1$, I have
$$1-p_{i} = 1-\sum _{i=1}^{J}{p_{i}}^{2}$$
$$p_{i} = \sum _{i=1}^{J}{p_{i}}^{2}$$
This is weird because according to this result all $p_i$'s should have same values which is obviously wrong.
Can you tell where am I doing this wrong?
I think you are making a mistake there: $I_G(p) = \sum_{i=1}^{J}p_i\sum_{k \neq i}p_k$; this is not equal to $(\sum_{i=1}^{J}p_i)(\sum_{k \neq i}p_k)$. Since $\sum_{k \neq i}p_k$ depends on $i$ and $p_i$ are not equal to each other for all $i$, you can't factorize the two summations like that. By doing this, you implicitly assume that all $p_i$s are equal actually, hence you get the related result. But this is not true for an arbitrary distribution $p_i$.