How to find the threshold to split the data based on a feature in decision trees

31 Views Asked by At

I was studying decision trees from a book about machine learning and it says that the rule applied to partition the data in a node of the tree is the one that maximizes the information gain, which is the following function \begin{equation} IG(D_p, f) = I(D_p) -\sum_{j=1}^c \frac{N_j}{N_p} I(D_j) \end{equation} Where $D_p$ is the dataset of the node, $f$ is the feature on which the rule is applied, $I$ is a function like Gini's impurity or the entropy function, $D_j$ is the j-th subset of $D_p$ generated by the rule, $c$ is the number of partitions made from the rule and $N_j$ and $N_p$ are respectively the number of elements in the j-th partition and the number of elements in the original node's dataset $D_p$.
With this being said I don't really understand how this is maximized; for instance, assuming the input data is numeric, in a simple binary decision tree the decision rule must be something like $$ x_f \ge t \qquad \qquad t \in \mathbb{R} $$ So, assuming this simplified scenario, how does one find $t$ in order to maximize the information gain for the feature $f$?

1

There are 1 best solutions below

0
On

Nevermind, I think I figured it out.
Suppose that $x^i \in D_p$ is the i-th sample of the parent node, and that $D_1$ and $D_2$ are the two datasets created by the binary split given by the rule applied in the current node.
Any $t$ in an interval that doesn't change the classification of any $x^j$ from $D_1$ to $D_2$ has no effect on the information gain $IG$, since in the end the two subsets of the data created are the same, and so are their sizes and values for $I$.
This means that it only makes sense to check the values that CAN change the classification of at least one certain $x^i$ from $D_1$ to $D_2$ or viceversa. Assuming the form of the rule specified in the question ($x_j \ge t$) it is clear that such values for $t$ are simply $x^1_f, x^2_f, \ldots, x^n_f$.
Furthermore I would assume that since the method described is fundamentally a greedy algorithm, as long as it's been estabilished that a certain value for $t$ changes the classification of a sample $x^i$ from $D_1$ to $D_2$, if the correct classification for $x^i$ is $y_i$ then there is no need to check any other value of $t$ that causes a change of any other single $x^j$ from $D_1$ to $D_2$ if $y_j = y_i$ as it will have the same exact effect on the information gain as the case of $x_i$, which has already been explored.