Theoretical impossibility? Deviation from normality with a sample greater than 300?

247 Views Asked by At

Huge thanks in advance!

I've been lead to believe that the following is a theoretical impossibility: a population larger than 300 records without an approximation of a normal distribution. The dataset I used is a set of amounts of financial transactions (specifically, donations to a charity).

Using R, I performed multiple Shapiro-Wilks tests using multiple random samples (sized 5000) drawn from about 100000 gift amounts and received a p value of 0. This is supposed to indicate that that the sample deviates from normality.

Are the following claims in fact inconsistent (as they seem to be):

  1. My sample of gift amounts is larger than 30
  2. My sample of gift amounts approximates a normal distribution
  3. According to the Shapiro-Wilks test, my data set deviates from a normal distribution
2

There are 2 best solutions below

3
On BEST ANSWER

I strongly suspect the reason that the Shapiro-Wilks test is telling you that the sample is not coming from a normal distribution is because the underlying base distribution is not normal. Basically if the underlying distribution is not normal then approximating it with a normal distribution isn't going to work because even if you get a lot of samples its still not a normal distribution.

EDIT: I just saw your comment about textbooks and using the normal approximation after a certain sample size. These textbooks will be referring to the sample distribution being normally distributed after a certain number of samples. This is not the same as finding the underlying or base distribution though and might have led to some confusion.

For some intuition building: The reason why approximating the sample distribution with the normal distribution tends to work out frequently is because very many distributions found in nature are either normal or satisfy the requirements for the CLT to be true. The most basic case is the times we have a normal distribution as the underlying distribution, in this case the sample distribution will be normal for any sample size. For example we might have a different non-normal base distribution say the height of certain species of trees. You might see that there are some that are double the height of others and even less that are 3 times the height and so on. Now the interesting thing is that say the underlying distribution is not normal but satisfies the requirements for the CLT to hold so then the sample distribution is normal for a large enough sample size. I think this is what the textbooks are mostly getting at when they say for sample of N > 30 you can start to use the normal distribution as an approximation for the sample distribution. This is all well and good if the distribution you are approximating is indeed normal or satisfies the requirements for the CLT. However the financial world has many non-normal distributions in it and further has situations where the variance and mean are poorly defined. For something like donations it is entirely plausible that someone makes a donation that is 10^n times bigger than someone else (with n being a large number). This type of difference is not something that you would see in say a normally distributed sample of say the height of trees in nature, there is no tree that will be 1000000 times taller than the average and yet in the financial world these types of things certainly can, and do, happen. The impact of extreme events on the mean and variance of an individual sample results in having to have a larger sample. If the mean of a sample is overwhelmingly defined by the maximum donation made in that sample then I think we are starting to get into the realm of the extreme value theorem and that I suspect is what has happened here. In this case we would be essentially looking for the maximum value found in each sample and this is no longer a normal distribution.

As for finding the underlying base distribution, the last time I looked at something similar to this I found that the underlying distribution was actually an extreme value distribution. I would suggest as a starting point looking at generalized extreme value distributions for this.

I would check using some Q-Q plots what the underlying distribution is. It might be a bit qualitative for people's liking on here but you can start to see if a distribution is not a normal distribution this way.

2
On

There are many distributions in the world that are not normal. Many financial things are like this-I would be shocked if the number of gifts were not a decreasing function of size (with some bumps for round numbers). Why should it be normal? Certainly the number of teeth of people is not a normal distribution.