A measure similar to variance that's always between 0 and 1?

1.5k Views Asked by At

Consider the following histogram, obtained from around 1000 measures of distance.

enter image description here

As you can observe, most of the data appears near the mean arond the value 5-10. I also have some isolated samples far away at values 100, 160.

1) Is there any statistical measure I can use to detect when this happens? Sometimes there are no outliers and I'm trying to detect such cases. I was thinking of thresholding variance, but I'm looking for a measure with a value in a fixed interval (e.g. always 0 to 1).

2) I'm trying to get an interval like the one in red that only includes the measures around the mean. I'm looking for a method that works for different histograms with a similar shape (number of readings and values can vary, but shape is always similar). Could you suggest me a method?

3

There are 3 best solutions below

3
On BEST ANSWER

In your case, I think variance is not the right approach (see the Note at the end). Perhaps you could consider using boxplots for 'outlier detection'.

Here is a brief example using exponential data, which tend to have outliers. (The exponential distribution is often used to model waiting times for events or lifetimes of electronic components.) Consider the data below, generated using R statistical software. Twenty observations are rounded to one place and sorted:

 x = sort(round(rexp(20, .01), 1));  x
 [1]   0.2   0.7   2.6  14.7  28.3  31.1  39.3  45.0  48.7  56.5
[11]  63.0  77.0  77.7  80.2  81.9  96.8 103.6 110.9 157.2 245.1

Sample statistics are shown below. Roughly speaking the lower quartile 30.40, the median 59.75, and the upper quartile 85.62 divide the sorted data into four 'chunks' of five observations each. The interquartile range IQR $= Q_3 - Q_1 = 55.225$ is the width of the box in a boxplot and an important measure of variability for detecting outliers.

summary(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.20   30.40   59.75   68.03   85.62  245.10 
[1] 58.4176  # standard deviation
[1] 55.225   # inter-quartile range 

The ends of the box in a boxplot are at the quartiles, the median is marked by a heavy bar inside the box.

boxplot(x, horizontal=T, col="skyblue2")

enter image description here

The largest observation 245.1 is noted as an outlier, and plotted separately in the boxplot. It is noted as an outlier because it is greater than $Q_3 + 1.5(\text{IQR}) = 168.46.$ (This is known as the `1.5 IQR criterion'. This criterion is popular, but there are others.)

Please note that there is nothing "wrong" with observation 245.1. As I said earlier, it is the nature of exponential data to have outliers. (It would probably be best to keep the outlier when doing data analysis.)

For data such as yours, I suppose the straggling observations far above your red bracket would be marked as outliers. (Then you would have to consider for your data what circumstances might have produced these outliers, and how the outliers should be handled in data analysis.)

Most statistics books and many online sites have additional information about boxplots, outliers, and how to regard outliers in data analysis.

Note: Variances (and standard deviations) do not work well for outlier detection. If $X_i$ is an outlier, then the term $(X_i - \bar X)^2$ in the variance can be unusually large. So measuring the distance of an observation from $\bar X$ in terms of standard deviations can be misleading because the outlier itself has a large effect on the variance (and hence, the standard deviation). By contrast, outliers do not have much effect on the size of the interquartile range (IQR). Thus IQR is more effective in outlier detection.

In the example, changing the last observation from 245.1 to 100.0 reduces the standard deviation of the sample from 58.42 to 41.96, but does not change the IQR at all.

0
On

To answer the title question, if $|X - X_0| \leq 1$, then the variance of $X$ has to be bounded by $1$. So you could use any real-valued function that collapses the range of $X$ down to an interval of radius 1.

For example, you could measure

$$ \mathrm{Var}\left( \frac{2}{\pi} \arctan(X - X_0) \right) $$

(this answer does not attempt to address any of the contents of the post)

0
On

One example of such functions is the exponential family:

$$f(v) = \exp[-v^k/s^k]$$

You input variance, which is in $[0,+\infty]$ and you get out something which is $[0,1]$

  1. If variance is $0$ you get $1$ out and
  2. the larger variance the closer you will get to $0$.
  3. $s$ and $k$ are both parameters you can steer how fast to shrink to $0$.

If you want the opposite you can just take $1-f(v)$ instead.