What is the best way to test if data is normally distributed?

858 Views Asked by At

I have two samples $X_1 \; (N= 97)$ and $X_2 \; (N=4782)$ drawn from the same population data. I like to test (using Statistical Visualizations such as normplot and qqplot, and Hypothesis Tests such as jbtest, chi2gof, and kstest in MATLAB) if the data from each sample is normally distributed.

My First Data is

    X = [8.13010235400000,13.6713071300000,14.0362434700000,18.4349488200000,26.5650511800000,30.9637565300000,34.3803447200000,40.6012946500000,45,49.3987053500000,58.6713071300000,59.0362434700000,59.0362434700000,59.0362434700000,61.9275130600000,61.9275130600000,63.4349488200000,63.4349488200000,63.4349488200000,63.4349488200000,63.4349488200000,64.4400348300000,71.5650511800000,71.5650511800000,71.5650511800000,71.5650511800000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,78.6900675300000,90,90,90,90,90,90,90,90,90,90,90,90,90,90,90,93.1798301200000,97.1250163500000,97.7651660200000,102.528807700000,102.528807700000,102.528807700000,102.528807700000,102.528807700000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,105.255118700000,108.434948800000,108.434948800000,108.434948800000,108.434948800000,109.440034800000,116.565051200000,118.072486900000,120.963756500000,127.746805400000,130.601294600000,135,137.489552900000,139.398705400000,139.398705400000,149.036243500000,153.434948800000,159.227745300000,161.565051200000,179.999998800000,180];

The analyses using statistical visualizations in MATLAB, shows that the underlying distribution for both samples are normal. However, from the hypothesis tests, the null hypothesis for the $X_1$ sample is not rejected using the same significance value (except for the chi-square test), but that for the $X_2$ is completely rejected.

I am now confused as to how to prove my samples are normally distributed and as well come from the same population data. Please, what can I do in this situation?

PS : Sample $X_2$ is too large for me to post, but if there is any suggestion on how I could show this, then I don’t mind.

Attaching Image
enter image description here

2

There are 2 best solutions below

1
On

You should use a Shapiro-Wilk Test. Let

$$H_0 : \text{ the data is normally distributed}$$

$$H_a : \text{ the data is not normally distributed}$$

R statistical software gives

x<-c(8.13010235400000,13.6713071300000,14.0362434700000,18.4349488200000,26.5650511800000,30.9637565300000,34.3803447200000,40.6012946500000,45,49.3987053500000,58.6713071300000,59.0362434700000,59.0362434700000,59.0362434700000,61.9275130600000,61.9275130600000,63.4349488200000,63.4349488200000,63.4349488200000,63.4349488200000,63.4349488200000,64.4400348300000,71.5650511800000,71.5650511800000,71.5650511800000,71.5650511800000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,75.9637565300000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,77.4711922900000,78.6900675300000,90,90,90,90,90,90,90,90,90,90,90,90,90,90,90,93.1798301200000,97.1250163500000,97.7651660200000,102.528807700000,102.528807700000,102.528807700000,102.528807700000,102.528807700000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,104.036243500000,105.255118700000,108.434948800000,108.434948800000,108.434948800000,108.434948800000,109.440034800000,116.565051200000,118.072486900000,120.963756500000,127.746805400000,130.601294600000,135,137.489552900000,139.398705400000,139.398705400000,149.036243500000,153.434948800000,159.227745300000,161.565051200000,179.999998800000,180)

shapiro.test(x)

    Shapiro-Wilk normality test

data:  x
W = 0.96484, p-value = 0.01061

Since $0.01061<0.05$ we have significant evidence at $\alpha=0.05$ to reject the null hypothesis and conclude that the data is non-normal.

A Q-Q Plot supports our conclusion:

enter image description here

4
On

Graphical methods. First, some general comments. Graphical methods include:

  • making a histogram on a density scale and then overlaying a normal curve with $\mu = \bar X$ and $\sigma^2 = S^2,$ where $\bar X$ and $S^2$ are the sample mean and variance respectively,

  • plotting an empirical CDF of the data along with the CDF of the normal distribution (as above),

  • making a 'normal probability plot' (also called a 'normal Q-Q plot') to see if points lie mainly along a straight line.

Especially for large samples, such graphical descriptions of the data can give a good idea whether the data may have been randomly sampled from a normal population. But they do not provide formal statistical goodness-of-fit (GOF) tests.

Thank you for providing your first dataset. I used it to make the histogram and ECDF plots below. While I have been working on this, @Remy(+1) has provided an normal probability plot, so I'll skip that.

enter image description here

My impression is that these data have tails that are 'too fat' to have come from a normal population.

Formal GOF tests. A very good GOF test is the Shapiro-Wilk test implemented in R statistical software. It tests the null hypothesis that the data fit some normal distribution against the alternative that they do not. Failure to reject the null hypothesis is not the same thing as a guarantee that the parent population is normal. Rejection is a good indication that the data depart in some important way from what one expects of a random sample from a normal population. Generally speaking, given a sufficiently large real-world sample, the Shapiro-Wilk test will reject, because precisely normal data are rare in practice and a large sample makes it easier to find the flaws (possibly unimportant ones).

@Remy also shows a Shapiro-Wilk test, with results not consistent with normality. I get the same result, and will not post my result.

As implemented in most software, the Kolmogorov-Smirnov test uses the null hypothesis that the data fit a particular normal distribution. Consequently, if $\mu$ and $\sigma^2$ must be estimated by $\bar X$ and $S^2,$ the probability theory behind the K-S test is not quite correct (larger errors for smaller samples).

Notes: (1) You mention the 'jbtest' and 'chi2gof'. I am not sure exactly what these tests are. I suppose 'kstest' must be some form of the K-S test, but I don't know whether it tests against any normal distribution or a specific normal distribution. (2) Thank you for providing data for your first sample. I have provided a histogram and an ECDF plot. (Also, confirmed the @remy's normal probability plot and result from the Shapiro-Wilk test.)