When is my lightgbm going to find cut points in random variables that reduce entropy more than a naturally correlated variable with the target?

28 Views Asked by At

In machine learning sometimes we build models using hundreds of variables/features that we don't know (at least at first) if they might have a relation with the target. Usually we find that some of them do and others don't. Some of them even have a true relation that we couldn't think of at the begining.

Once we build a first model, sometimes we have an idea to include a new variable that we know it has a natural relation with the target and that we didn't think of at first. Sometimes this new variable is sparse, though, what means that it's constant or null for the major part of the data. The problem then is that to use the information that the new variable has, we need to find in some node a cut point of that variable that reduces the loss_function more than all the other cut points of the other variables. However, a sparse variable usually doesn't reduce the loss_function a lot because the major part of the data ends on the same side and only a very few part of the data goes to the other side. Also, when we have that amount of variables, statistically we find cut points in other variables which are not related to the target that reduce more the loss_function for those data points in that node. Not because there is a true relation, but just because of statistics. This ends in overfit and also not being able to use the prediction capacity of our new variable.

In this circumstances, what can we do to extract the value of our new variable?

1

There are 1 best solutions below

0
On

The problems you're pointing out are:

  1. The new feature only applies to a small subset of the data, so the tree never splits on this feature
  2. Noise causes splits in features

1 can be addressed by increasing the complexity of your model. In your case, possibly by increasing the number of trees.

2 can be addressed by collecting more data. The issue you pointed out is your estimate of the loss has too high of variance. You can reduce this variance by collecting more data.

For a fixed dataset size, 1 will make the problems related to 2 worse. So it's a tradeoff between the two. You can't get much value out of features that only apply to a small portion of the data if your dataset is small.