When is it OK to introduce multicollinearity? (Machine Learning)

77 Views Asked by Bumbble Comm At 28 Mar 2026 - 7:51

Looking at the Kaggle Titanic Dataset there exists two columns of predictors:

Parch describes the sum of that passenger's parents and children that have boarded
SibSp describes the sum of that passenger's siblings and spouses that have boarded

I've seen multiple examples under the shared code where people create a new predictor column named Family that is the sum of Parch and SibSp.

Now, it is common sense to say that Family = Parch + SibSp. Therefore Family is easily predicted based on the other two columns -- that would make it multicollinear (multicollinear wiki) which I understand to be problematic.

So why make the family column? If it really is multicollinear, I don't see why so many people (who seem to have a better understanding of ML than I) are doing this.

EDIT: I've also discovered people sometimes will combine variables they expect to demonstrate multicollinearity, into a single new variable. I.E. in the problem described, if it is suspected that SibSp is a predictor of Parch, it would make sense to create Family, though the video [timestamp 14:05] offers no explanation as to why.

Original Q&A

There are 1 best solutions below

Bumbble Comm On 10 Mar 2021 - 4:18

Multicollinearity is mainly an issue when fitting Linear regression models. However, in machine learning applications it is rare to fit just a vanilla OLS model. Moreover, adding a regularization term is very common, and even for OLS model this can address many of the problems of multicollinearity since it yields a problem that has a unique solution and with controlled condition number. The issues of parameter identifiability and signifiance estimates mentioned in the link are essentially irrelevant to machine learning, where the ultimate objective is predictive accuracy.

When is it OK to introduce multicollinearity? (Machine Learning)

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in DATA-ANALYSIS

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions