How can I impute missing values in data such that the end result is close to normal?

70 Views Asked by At

I have $N$ (about a 30) data values of which $n$ (about 4) are missing. Domain knowledge tells me that the overall distribution of my data should be roughly normal. Because of the missing values, the distribution is currently not normal.

How should I go about imputing the missing $n$ values such that my final data is as close to normal as possible? Or, is this an altogether foolish way of dealing with missing data?

2

There are 2 best solutions below

0
On

Comment continued:

The scenario you propose, that normality is spoiled by a few missing observations (deleted without regard to data values), seems unlikely. So I will check it out (using R statistical software).

I sample 30 observations from a normal population. I test them for 'normality' (using a Shapiro-Wilk test). In advance, I decide to delete observations 11 through 15 (in order of collection). I test the remaining ones for normality.

set.seed(331)
x = rnorm(30, 100, 15)
shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.95582, p-value = 0.2414

y = x[c(1:10, 16:30)]
shapiro.test(y)

       Shapiro-Wilk normality test

data:  y
W = 0.94638, p-value = 0.2075

Data are consistent with normal before and after deletion. (Both P-values are substantially greater than 5%.)

par(mfrow=c(1,2))
 qqnorm(x, datax = T, main="Original Data")
 qqnorm(y, datax = T, main="After Deletion of 5 Observations")
par(mfrow=c(1,2))

Another way to check for normality is to see if a normal probability plot is (roughly) linear. Plots for both datasets seem consistent with normality by this criterion also. [Slight 'wobbles' are OK, especially with such small sample sizes.]

enter image description here


Note: What you say can happen if I delete the largest five observations. After a dozen tries I encountered a dataset that barely passed the Shapiro-Wilk test (P-value about 0.06), but failed after deletion of the five largest observations (P-value about 0.005). Also, the normal probability plot of the data after deletion seems distinctly non-linear.

enter image description here

0
On

First of all, inventing data points within real studies is fraud.

Nevertheless, a possible way is as follows:

You said your data stems from a population you assume to be normally distributed $N(\mu,\sigma^2)$. So, this is a claim which should survive a $\chi^2$-test.

  • Split your data into classes for a $\chi^2$-test on testing the normal distribution
  • Calculate the $\chi^2$-test value for your data without added data items.
  • Check whether your $\chi^2$ value supports your claim. If not, your claim of normal distribution might be simply inappropriate.
  • Spot classes where the number of observations lies below the expected number according to the assumed distribution.
  • If there are any, fill "random" data points in these classes which may even lessen the $\chi^2$-test value.
  • If you cannot find such classes, add data items to classes where the $\chi^2$-test value of the class is least sensitive. These are the classes with highest expected numbers of observations.