Determining Normality With Large Samples

60 Views Asked by At

I am taking in some files and I must determine if the data sets are normally distributed (yes, within a certain degree of certainty because it cannot be proven only disproven). My data sets are quite large. Most are over 15,000 samples. What is a good test to run? I would rather not sample them at random and instead use the whole data set. Also, if possible, do you know how to do this in matlab? I can type out a method if need be, but it would be nice to use a preset function. Thanks.

1

There are 1 best solutions below

0
On BEST ANSWER

For the reason in @David's Comment most tests of normality don't accommodate samples of size larger than a few thousand. Real data tend to have small deviations from normality, which are not of consequence for the validity of statistical procedures, so there is no point in detecting anomalies that would be evident only in very large samples.

In R, here is how to sample $n = 15000$ observations from $\mathsf{Norm}(\mu = 100, \sigma = 15).$ For data sampled in R, one would not expect detectable differences from normality, up to the accuracy of double precision representation of the data.

set.seed(2019) # for reproducibility
x = rnorm(15000, 200, 15)

A summary of the data shows the sample mean and median about equal and first and third quartiles about equidistant from the median---as one would expect from normal data.

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  37.37   89.61   99.72   99.67  109.68  160.71 

Also a histogram of the data on a density scale (so that the areas of all histogram bars sum to unity) is well-matched by a normal density curve with $\mu$ approximated by sample mean $\bar X$ and $\sigma$ approximated by sample standard deviation $S.$

hist(x, prob=T, br=30, col="skyblue2")
mu.est = mean(x);  sg.est = sd(x)
curve(dnorm(x, mu.est, sg.est), add=T, lwd=2, col="red")

enter image description here

The Shapiro-Wilk test of normality can handle up to 5000 observations. We can test three blocks of 5000. After the first test, we use $-notation to show only the P-value of each test. If the P-value exceeds 0.05, we say that data are consistent with sampling from a normal population, but that is no proof that the data are perfectly normal.

shapiro.test(x[1:5000])

        Shapiro-Wilk normality test

data:  x[1:5000]
W = 0.9996, p-value = 0.4131

shapiro.test(x[5001:10000])$p.val
[1] 0.7028041
shapiro.test(x[10001:15000])$p.val
[1] 0.5594307

We can also sample random subsets of size 5000 (without replacement), and test them:

shapiro.test(sample(x,5000))$p.val
[1] 0.9059113
shapiro.test(sample(x,5000))$p.val
[1] 0.7519748

Furthermore, a good overview of the normality of the entire sample of 15000 can be had by looking at a normal probability plot.

qqnorm(x);  qqline(x, col="red", lwd=2)

enter image description here

The excellent fit to a straight line between theoretical quantiles $\pm 3$ indicates excellent fit of the data to a normal distribution. [There are not enough data points in the tails to overcome the randomness of sampling, so don't expect more than an approximate fit beyond $\pm 3.]$

Not all of these tests and descriptions of the data are necessary to check that there is no important departure from normality. You might pick ones you understand best from a theoretical point of view or ones with the greatest intuitive appeal. But do not expect data from real-life applications to show as good a fit to normality as those generated by trustworthy statistical software.