Why do we want a decision tree to be shallow? Why do we split to maximize information gain?

696 Views Asked by At

When constructing a decision tree, we use a measure such as the Gini Impurity or Information Gain to decide which split is best. IIUC, we want to have as shallow as tree as possible.

But why do we care? What difference would it make if we would construct a larger decision tree splitting on some non essential parameter first? Is it because we ideally don't want to rely on those features that do not seem to carry much information as there is a higher chance we would be fitting the noise and thus limiting our ability to generalize?

Am I on the right track here and has the relationship of relative signal strength vs likelihood of it being noise been formalized?

1

There are 1 best solutions below

0
On

You want a decision tree that is the "simplest" (so as to avoid overfitting the data) and that means the tree with the fewest nodes. A binary splitting rule that leads to half of the unassigned patterns to one branch and half to the other is of course the maximum-information decision, and thus is optimal at that level of the tree.