remove extreme values and calculate reasonable "average" of data set

1.7k Views Asked by At

About nature of data set, I am measuring run times of a two programs(A and B) that do same thing with different algorithm and would like to know which one runs faster. However measured values are close, and sometime A is faster, sometime B, and from observing values B seem to be faster more times (and in theory it should be). Also since it takes some time for programs to run, sets are small (let's say 20 samples)

I am looking for some method to remove extreme cases, and be left with set where all values are close to the mean. By some I mean most commonly used or personal preference. I would like to write program that does that, so some algorithm how to do it would be nice.

1

There are 1 best solutions below

4
On BEST ANSWER

If there is a good chance that the 'extreme cases' are a natural recurring feature of the software, likely to be encountered occasionally in the future, then they tell an important part of the story about the software and you should think carefully before removing them.

One method is to use a 'trimmed mean'. A 5% trimmed mean is computed by sorting the data, removing the bottom 5% and the top 5%, and then finding the mean of the remaining middle 90% of the data. (One hopes that atypical high or low values are among the observations trimmed. In some sense, by automatically trimming both high and low values, one can claim the process is 'fair'. But exponential data are right-skewed, so trimmed means are generally smaller than means of all the data.)

Here is an example using 500 values sampled from an exponential distribution with mean 20 (rate .05). [Computations in R statistical software.]

x = rexp(500, 1/20)
mean(x);  sd(x)
## 21.10404      # ordinary sample mean
## 20.73966      # sample standard deviation
mean(x, tr=.05)
## 18.8269       # 5% trimmed mean
.05*500
## 25
x1 = (sort(x))[26:475]  # remove smallest and largest 25
mean(x1)
## 18.8269       # 5% trimmed mean again
max(x1)
## 58.83594      # larges observation in 5% trimmed data

Here are boxplots of the original and trimmed data. The medians are the same before and after trimming, but trimming has reduced the mean. The median 14.8 is represented by the horizontal bar inside the box of a boxplot.

enter image description here

median(x);  median(x1)
## 14.80144   # median of original data
## 14.80144   # median of 5% trimmed data

sort(x)[250:251]
## 14.76019 14.84269  # middle two observations of sorted data
mean(sort(x)[250:251])
## 14.80144           # halfway between them

If you want more severe trimming, you could use 10% or 20% trimming. If you use 50% trimming you are just left with the sample median.