Calculating the GINI impurity

188 Views Asked by At

I have a question regarding the GINI impurity that we use while building decision trees.

From the wiki page on gini impurity,

Gini impurity can be computed by summing the probability ${\displaystyle p_{i}}$ of an item with label $i$ being chosen times the probability $${\displaystyle \sum _{k\neq i}p_{k}=1-p_{i}}$$ of a mistake in categorizing that item

To compute Gini impurity for a set of items with ${\displaystyle J}$ classes, suppose ${\displaystyle i\in \{1,2,...,J\}} $, and let ${\displaystyle p_{i}}$ be the fraction of items labeled with class ${\displaystyle i} $ in the set.

$$I_{G}(p)=\sum _{i=1}^{J}p_{i}\sum_{k\neq i}p_{k}$$

$$I_{G}(p)=\sum_{i=1}^{J}p_{i}(1-p_{i})$$

$$I_{G}(p)=\sum _{i=1}^{J}(p_{i}-{p_{i}}^{2})$$

$$I_{G}(p)=\sum_{i=1}^{J}p_{i}-\sum _{i=1}^{J}{p_{i}}^{2}$$

$$I_{G}(p)=1-\sum _{i=1}^{J}{p_{i}}^{2}\tag{1}\label{1}$$

In the first step-

$$I_{G}(p)=\sum _{i=1}^{J}p_{i}\sum_{k\neq i}p_{k}$$,

Instead of combining the two summations, I want to proceed in an alternative way that I have included below, but I end up getting a weird result

Since $\sum_{i}^{J}p_{i}=1$,

$$I_{G}(p)=\sum _{i=1}^{J}p_{i}\sum_{k\neq i}p_{k}=(\sum _{i=1}^{J}p_{i})(\sum_{k\neq i}p_{k}) = \sum_{k\neq i}p_{k}$$

$$I_{G}(p)= \sum_{k\neq i}p_{k}= 1-p_{i}$$

Using the results from equation $1$, I have

$$1-p_{i} = 1-\sum _{i=1}^{J}{p_{i}}^{2}$$

$$p_{i} = \sum _{i=1}^{J}{p_{i}}^{2}$$

This is weird because according to this result all $p_i$'s should have same values which is obviously wrong.

Can you tell where am I doing this wrong?

1

There are 1 best solutions below

3
On BEST ANSWER

I think you are making a mistake there: $I_G(p) = \sum_{i=1}^{J}p_i\sum_{k \neq i}p_k$; this is not equal to $(\sum_{i=1}^{J}p_i)(\sum_{k \neq i}p_k)$. Since $\sum_{k \neq i}p_k$ depends on $i$ and $p_i$ are not equal to each other for all $i$, you can't factorize the two summations like that. By doing this, you implicitly assume that all $p_i$s are equal actually, hence you get the related result. But this is not true for an arbitrary distribution $p_i$.