Missing Values in a Dataset for Multiple Regression

70 Views Asked by Bumbble Comm At 31 Mar 2026 - 9:31

I've seen countless projects and tutorials on Kaggle where authors use calculated mean values to replace missing values in a column. However, I am now reading the Multiple Regression. A Primer. book by Paul D. Allison in which the author writes this:

[...] Even worse than pairwise deletion is replacement with means. In this method, the missing values are replaced with the mean value of the variable for those individuals without the missing data.

So, I am curious is replacement with means a legitimate way to deal with missing values or are there more common and preferrable ways to do it?

Original Q&A

There are 1 best solutions below

Bumbble Comm On 22 Jun 2022 - 6:49

Replacement with the mean might not be appropriate because the person’s data point be very far off from the mean of that covariate. A better approach would be to use the person’s other covariates to infer the missing value of their covariate. There are whole books and courses devoted to missing data, but if you want somewhere to start, I recommend looking at the Expectation Maximization algorithm, which iteratively infers the most likely values of missing data while fitting the model/maximizing the observed data log likelihood.

Missing Values in a Dataset for Multiple Regression

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in LINEAR-REGRESSION

Trending Questions

Popular # Hahtags

Popular Questions