I have a relatively large database of 40,000+ entries, all with several data points per entry-line. The Data is collected on an ongoing basis. I would like to establish a baseline mean and standard deviation for several data-sets that exist within the database so that I can identify outliers as a part of a QA/QC process.
However, as far as I can tell, many of the recommendations for identifying outliers seem to assume that you can already establish a mean and standard deviation that is reasonable to begin with. However, this seems at first glance to be a bit circular when dealing with an empirical data-set. Won't the mean and standard deviation you produce be tainted by the outliers? Hence, won't some of the outliers fail to be identified by a method like evaluating n number of standard deviations away from the mean?
There are data values which I have identified as clear outliers and corrected or thrown out, but I'm concerned that I'm not including large quantities of outliers which exist on the edge of what seems reasonable.
I don't have a background in statistics or data management/analysis, but it's fallen on me to handle this database, so I would greatly appreciate any insights or responses on this matter. It may simply be that I'm missing some essential conceptual piece here.
The definition of an outlier is $1.5$ times the interquartile range outside of the 1st and 3rd quartiles. Use all the data points to determine the quartiles and eliminate the outliers. Determine the mean and standard deviation from the remaining data. Large quantities of outliers is a bit of a contradiction and may just mean the standard deviation is large.