Can I use mean and standard deviation to spot outliers?

984 Views Asked by At

I have a list of measured numbers (e. g. lengths of products). Of these I can easily compute the mean and the standard deviation.

Now, when a new measured number arrives, I'd like to tell the probability that this number is of this list or that this number is an outlier which does not belong to this list. Is this statement possible given only mean and stddev?

Can I compute the probability with which this new value is part of the list? I'd like to have a probability as a result.

3

There are 3 best solutions below

12
On BEST ANSWER

Absolutely. It is a known fact that for a sufficiently long list , (denoting mean by $\mu$ and standard deviation by $\sigma$) the range $[\mu-3\sigma,\mu+3\sigma]$ encompasses about (more than) $99.73\%$ of the data points, so if the new value is out of this range then it is $99.7\%$sure to be out of the list

You can somewhat use the concept of $p-value$ here. (Assuming the new value to follow gaussian distribution,since we don't know) ; find out the value of $\Phi(x)$--(CDF of $N(\mu,\sigma^2)|_{x=\text{new value}}$) Its $p-value=1-\Phi(x).$ If $p-value\lt $ some confidence level(say 0.05) then you can consider it within the list else not.

1
On

Yes. You can use your Standard Deviation to tell you this. Think about what Standard Deviation is telling you.

1
On

It is best to use a boxplot to find outliers. The problem with using the sample mean $\bar X$ and the sample SD $S$ is that an outlier seriously affects the values of $\bar X$ and $S$.

By contrast, the boxplot uses the median and the interquartile range to detect outliers. These measures of location and dispersion, respectively, are not much affected by outliers.

If you feel you must use $\bar X$ and $S$, then here is how to test observations one at a time for outliers: Omit the suspected outlier. Find $\bar X^*$ and $S^*$ from the remaining $n - 1$ observations. Then see if the omitted point is in some interval such as $(\bar X* - 2.5S^*, \bar X* - 2.5S^*)$. If so, the suspected observation is not judged an outlier. If outside the interval, then consider it an outlier. The disadvantage of this method is that you have to recompute $\bar X^*$ and $S^*$ afresh for each suspected outlier.