I require a specific measure of variability to measure the scatter of a set, about the mean point.

169 Views Asked by At

I'm looking for a unique measure of variability. My problem definition is as follows.

Given a set of data, I require a measure to determine how scattered the data is, about the mean position. The standard Statistical measures such as Variance and Standard deviation cannot be used, as the value of such parameters would increase with the increase in outliers in the data set.

For an example of a set of data of length 6, ranging from 0-500, and having a mean of 250, consider 4 situations:

i) set1 = {0, 100, 200, 300, 400, 500}

ii) set2 = {0, 100, 100, 400, 400, 500}

iii) set3 = {0, 250, 250, 250, 250, 500}

iv) set4 = {0, 0, 0, 500, 500, 500}

The most desirable set in this situation would be set1 as it is the most scattered. Variance ad SD measurements taken on theses sets would give a higher value for set4.

Is there an existing measure for the same? If not, could I have suggestions on methods to measure it?

1

There are 1 best solutions below

0
On BEST ANSWER

You are correct that the variance (or SD) is inflated by the presence of 'outliers'. The question is whether the outliers are a natural part of the population or process being measured, or whether outliers represent 'errors' that need to be removed. (In the former case, outliers are part of the 'scatter' you are trying to measure; in the latter, you could remove the outliers, then find variance or SD.)

There are alternative measure of variability that are less sensitive to observations far from the mean. One is the mean absolute deviation (MAD), sometimes defined as $\frac{1}{n}\sum_{i=1}^n |X_i - \bar X|.$ One can also find the median of the absolute deviations $|X_i - \bar X|.$ [The decreased sensitivity comes from the fact that the absolute value $|X_E - \bar X|$ tends not to be as large as the square $(X_E - \bar X)^2,$ where $X_E$ denotes an extreme observation.]

Another method is to trim the data by removing a certain percent (often 5%) of observations from the low and high end of the (sorted) data. Then find the trimmed mean of the remaining 90% of the observations, and finally find the SD or MAD of the remaining observations.

Example: I took many samples of size $n = 100$ from an exponential distribution with unit population mean $\mu$ and SD $\sigma$. The ordinary sample means and ordinary sample SD's averaged very nearly 1, so these sample statistics are accurately estimating the population mean and SD.

Then I repeated the experiment, deleting all observations above 3 (about 5% or 5 out of 100) as 'outliers'. (There were no extreme values at the low end.) Then the means and SD's of the remaining observations were about 0.84 and 0.71, respectively. So by trimming the 'outliers' (which are really a natural feature of the exponential distribution) I got seriously incorrect estimates of the population mean and SD.

Also, the MAD's (as defined above) of the original samples, without deletions, averaged 0.87, so this measure of variability is not estimating the population SD.

While trimming 'outliers' and using MAD's can be useful in some cases, it is necessary to understand the consequences of these methods before using them.

The general topic here is "robust estimation," which you might want to read about online or in journal articles.

Hotes: (a) Your four examples stress symmetrical data with both high and low extreme values. But they are not representative of all possible patterns of outliers. Also, it is difficult to discuss outliers in such small samples. More problematic (and perhaps more common in practice) are severely skewed data in which extreme high values (or low values, but not both) are a natural part of the population distribution. For example, exponential data are right-skewed. (b) You should not use the word 'parameter' to describe a statistic (such as a sample variance) computed from data; the word 'parameter' is ordinarily reserved for numerical characteristics of the population.