Missing Values in a Dataset for Multiple Regression

70 Views Asked by At

I've seen countless projects and tutorials on Kaggle where authors use calculated mean values to replace missing values in a column. However, I am now reading the Multiple Regression. A Primer. book by Paul D. Allison in which the author writes this:

[...] Even worse than pairwise deletion is replacement with means. In this method, the missing values are replaced with the mean value of the variable for those individuals without the missing data.

So, I am curious is replacement with means a legitimate way to deal with missing values or are there more common and preferrable ways to do it?

1

There are 1 best solutions below

3
On

Replacement with the mean might not be appropriate because the person’s data point be very far off from the mean of that covariate. A better approach would be to use the person’s other covariates to infer the missing value of their covariate. There are whole books and courses devoted to missing data, but if you want somewhere to start, I recommend looking at the Expectation Maximization algorithm, which iteratively infers the most likely values of missing data while fitting the model/maximizing the observed data log likelihood.