Outlier-resistant average of a set?

571 Views Asked by At

I have a web application in which users can provide valuations for items. These valuations are currently unrestricted, and there is no non-arbitrary criteria by which I could restrict them, since valuations are purely subjective. For each item, I have an "average valuation", which is currently just a simple mean of all valuations for the item. This was a naive choice that has inevitably lead to abuse, whereby a small subset of users intentionally provide extreme valuations to manipulate this "average valuation" to their benefit.

I'd like to solve this problem mathematically, without imposing arbitrary restrictions on submitted valuations. As far as I can tell, it's very commonplace for item valuations to follow a normal distribution, but I admittedly don't remember much of my statistics from decades ago. I have been attempting to brush up on standard deviations, but it seems like I can't exclude valuations over 2σ because these extreme outliers raise the population standard deviation (I could be misunderstanding this concept, though). So...

What statistical approach can I take to determine the "average valuation", while remaining resistant to this kind of abuse?

For reference, here is an example set of valuations for a particular item. The outliers here are clearly the [317, 318, 630, 630, 640, 6511] subset:

60, 63, 63, 63, 63, 63.5, 63.8, 63.8, 63.9, 63.9, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64.2, 64.5, 64.5, 64.5, 64.5, 64.5, 64.9, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 67, 67, 67, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 72, 74, 74, 75, 75, 75, 75, 78, 80, 85, 317, 318, 630, 630, 640, 6511

For this set of valuations, I would expect an "average valuation" somewhere around 65 or 66.

3

There are 3 best solutions below

1
On BEST ANSWER

The median, as suggested by @Henry seems the simplest solution to your problem. You might also consider a 'trimmed mean'. Below I show the mean, a 5% trimmed mean (average the middle 90% of your data after discarding the top and bottom 5%), and median. (In a sense, the median is a 50% trimmed mean.) You could try out various degrees of trimming to see what works best in your situation.

I pasted your data into R statistical software with the following results.

 mean(x);  mean(x, tr=.05);  median(x)
 ## 107.3667   # ordinary mean
 ## 66.06368   # 5% trimmed mean
 ## 65         # median

Both the trimmed mean and the median would also provide protection if there were users who purposely gave absurdly low values.

Another approach would be to use a 'boxplot' to detect 'outliers', ignore the outliers, and average the rest. The default outlier-detection method may include as outliers some values you would want to keep. (You could adjust the outlier rule to be less aggressive.)

The procedure below ignores the possibility of low outliers, and only omits the high ones. It has some potential of giving a result biased on the low side.

 boxplot.stats(x)$out
 ##   75   75   75   75   78   80   85  317  318  630  630  640 6511
 mean(x[x <= boxplot.stats(x)$stats[5]])  # effect is to avg values 74 or less
 ## 65.77665

My guess is that you want something simple and automatic. I would probably use the ordinary mean, 5% trimmed mean, and median in tandem for a while and then pick one of the latter two, depending on track record.

0
On

In statistics, there are three major errors: bias, variance, and contaminations. Sample mean is a consistent estimator, however, its variance and robustness is not desired in many scenarios, so historically, many attempts have made to reduce the overall errors of mean estimation. For example, trimmed mean, Winsorized mean, Hodges-Lehmann estimator, Huber M-estimator, and median of means.

Comparing the performance of these estimators is still a hot topic in statistics research. Recent advances suggested that the bias bound of Winsorized mean is better than that of the trimmed mean (Mariusz Bieniek (2016) Comparison of the bias of trimmed and Winsorized means, Communications in Statistics - Theory and Methods, 45:22, 6641-6650, DOI: 10.1080/03610926.2014.963620 )

Also, the concentration bound of median of means nears the optimum of sub-Gaussian mean estimator (Luc Devroye. Matthieu Lerasle. Gabor Lugosi. Roberto I. Oliveira. "Sub-Gaussian mean estimators." Ann. Statist. 44 (6) 2695 - 2725, December 2016. https://doi.org/10.1214/16-AOS1440 )

Statistics is undergoing a trend from nonparametrics to semiparametrics, because these research are new, so they are not emphasizing in current textbooks.

In my paper, I defined two new classes of semiparametric distributions, introduced several new mean estimators and further explain why the Winsorized mean is better than the trimmed mean in most cases. Also, the median Hodges-Lehmann mean is proposed as the optimun nonparametric robust mean estimator. If you are interested, you can watch my youtube videos or read my paper https://www.youtube.com/playlist?list=PLv12WMZUyCNCxgQdS8wguSWs60uKttHaM .

0
On

In my case, the outliers are important and should influence the end result slightly, so here's a method that I decided to use:

    static double OutlierResistantMean(List<double> items)
    {
        var a = items.Average();
        var eps = (items.Max() - items.Min()) * 1e-5;
        while (eps>0)
        {
            Exponential e = new Exponential(1 / items.Average(i => Math.Abs(i - a)));

            double s1 = 0, s2 = 0;

            foreach (var i in items)
            {
                var d = 1 - e.CumulativeDistribution(Math.Abs(i - a));
                s1 += i * d;
                s2 += d;
            }

            var na = s1 / s2;
            if (Math.Abs(na - a) < eps)
            {
                break;
            }
            a = na;
        }

        return a;
    }

Where Exponential is exponential distribution taken from MathNet.Numerics package

In your case, the function returns 66.0967586215405

I suppose if you believe that the outliers are errors, you could try and use normal distribution instead