Can normal distribution stats be used on this data?

545 Views Asked by At

Background: I'm analyzing operating times for "gadget". At some moments the operation times are very high (emergency situation), so the data has a lot of outliers:

data

I have eliminated outliers using modified Z-score like so (tools used: IPython/Scipy):

m = s2['time'].median()
s2['madbase'] = abs(s2['time'] - m)
mad = s2['madbase'].median()
s2['modzscore'] = 0.6745 * (s2['time'] - m)/mad
s3 = s2[s2['modzscore'] < 3.5]
from scipy.stats.morestats import probplot
probplot(s3['time'], plot=matplotlib.pyplot)

The probability plot for data w/o outliers looks like this then:

probplot no outliers

So this does not look very much like normal distribution, but.. r squared is close to 1, so the fit should be good?

Can I reliably interpret this data (with outliers eliminated) as data having normal distribution?

(Note: I'm a programmer and my memory of univ-time statistics course is rather hazy... I have used NIST materials as basis for outlier elimination, if I used it in a wrong way / I should have used some other technique pls explain)

Update: I used probplot function from scipy.morestats package:

https://github.com/scipy/scipy/blob/master/scipy/stats/morestats.py#L296

Y axis for "no outliers" (2nd graph) are operating times (values). X axis seem calculated by probplot are quantiles (so it's not even typical QQ or PP plot).

I've done this mostly bc this page recommends doing this as a test:

http://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htm

2

There are 2 best solutions below

5
On BEST ANSWER

You have produced a normal QQ-plot, which should plot as a straight line if your data were in fact normal. Your data definitely does NOT fall on a straight line, despite the large $R^2$ value (now you know a weakness of that measure of fit). This data is decidedly non-normal, so using a normal distribution would lead to incorrect or innacurate inferences. Even without outliers, your data exhibit excessive right skew compared to a normal distribution. The modified data also appear to be relatively light-tailed (less probability in the extremes) than a normal distribution. Here are some suggestions:

  1. The first task in data analysis is to determine what question you are trying to answer. Why are you eliminating "outliers"? They seem fairly common in your data and appear to be an important feature of your system. You should keep these unless you are only trying to model or analyze "normal" or "non emergency" situations.
  2. Depending on the answer to 1, you will need to decide how sophisticated you want to get. If you insist on using normal-based theory, you need to transform your data to make it look normal. There are a number of techniques to do this: Since you are familiar with the NIST manual and are a programmer, I assume you have good facility with different comp. languages -- you may want to try implementing a Box-Cox Transform on your modified data.
  3. Another option is to avoid transforms and fit a more appropriate model to your raw data (modified or not) using several common parametric fitting methods: maximum likelihood, or method of moments. There are several parametric families of distributions that can usefully model right sekwed, non-negative data: Gamma, lognormal or various others. A helpful gallery of distributions is also on NIST, so you can see the general shapes (and read about their properties) and see which best fit your data.

You can evaluate most distributional fits using QQ-plots for different distributions (check your software documentation to see what QQ plots are avaialble or probability plots) and see which gives the most linear relationship.

Once you have a satisfactory model (i.e, it captures the KEY features you care about...in your case, I'd guess it's the propensity for outliers and so you would want to have a distribution with relatively high Kurtosis (>3) or excess kurtosis (>0), if your data has such a high kurtosis). Now, you can use your fitted model to make inferences, do Monte Carlo simulations, or whatever else you want to analyze.

1
On

It's hard to say for sure what's going on here, but it looks like you just fit a line to the cumulative distribution function, but with axis units in terms of... well, i'm not sure. You would get more mileage out of the probability density function in this case. I say this because of the roughly sigmoidal shape. On the one hand, the normal distribution would give you a CDF that looks like this, on the other hand, so do a wide variety of probability distributions.

Fitting a line to this data (without regard to r^2) does not help you test or reject the hypothesis that the data is normal. Instead, you should try some normality testing.

As for those 'outliers', how much of your data do they represent?