How to tell when a data series is a normal distribution

414 Views Asked by At

I have a list of data, and I need to know if the values are normally distributed. 4 of the 20 values on the list lie outside of 3 standard deviations of the mean. Is the data normally distributed and why?

1

There are 1 best solutions below

3
On

It is not unusual for normal samples to show boxplot outliers. However, it would be unusual to have two observations in 20 lie outside the interval $\bar X \pm 3S,$ where $\bar X$ and $S$ are the sample mean and SD, respectively.

A commonly used graphical procedure is to make a 'normal probability plot', also called a 'Q-Q plot' (or 'quantile-quantile' plot). Roughly speaking, the points on a normal probability plot should lie in a straight line (with some recognition that it is not unusual for a few of the smaller or larger observations to stray from the line).

There are many formal statistical tests of normality. One of the better ones is the Shapiro-Wilk test. With a sample size as small as $n = 20$ it is very difficult to distinguish normal data from data that are only approximately normal. (It is generally futile to try to judge normality by looking at a histogram of a sample of 20.)

I will illustrate these methods using four random samples of size 20, the first two from normal populations, the third from an exponential population and the fourth from a uniform population.

Boxplots. The second normal sample happens to show two boxplot outliers; not unusual. However, this sample has $\bar X \pm 3S$ approximately equal to $(52,\,151),$ which includes the outliers at about $65$ and about $139.$ so your three-SD rule is not violated.

Exponential samples usually show high outliers; it is unusual to see only one outlier here. However the boxplot of the exponential sample shows a marked skewness towards high values.

enter image description here

Normal Probability Plots. Both normal samples (top) have points roughly in a straight line. The exponential sample is pretty clearly not normal. The plot of the uniform sample (lower-right) seems to show lack of linearity toward the right.

enter image description here

Shapiro-Wilk tests. P-values below $0.05$ lead to rejection of the null hypothesis that the data are from a normal population. Only the exponential sample is clearly not normal.

shapiro.test(norm.1)

        Shapiro-Wilk normality test

data:  norm.1 
W = 0.9712, p-value = 0.779

shapiro.test(norm.2)

        Shapiro-Wilk normality test

data:  norm.2 
W = 0.9778, p-value = 0.903

shapiro.test(expo.3)

        Shapiro-Wilk normality test

data:  expo.3 
W = 0.8122, p-value = 0.001323

shapiro.test(unif.4)

        Shapiro-Wilk normality test

data:  unif.4 
W = 0.9454, p-value = 0.3028