Maximise difference between mean and median

3.1k Views Asked by At

I have a set of numbers and I want to maximise the difference between the mean and the median by removing a given number of elements.

For instance I have the following set:

0 5 10 45
Mean = 15
Median = 7.5
Mean - Median = 7.5

If we choose to remove 1 element I want to obtain the following

0 5 45
Mean = 16.66
Median = 5
Mean - Median = 11.66 => MAXIMISED

We can assume that the set is sorted.

Is there an algorithm for doing this?

2

There are 2 best solutions below

0
On

One hint of how to fast compute is that you can precompute the sum of all $N$ numbers, let us say it is $S$, then the mean value after removing the values $x_1,\cdots, x_k$ would be $$\frac{1}{N-k}\left({S - \sum_{1}^k x_i}\right)$$ if you want to remove only a small part of the elements, then computing the above will be faster than re-computing the whole sum each time for each new subset of $x_i$. So if we have $N = 100$ elements and remove $k = 3$ for each configuration of $x_i$, we only need to do 3 subtractions instead of 97 additions.


For the index of the median, we can do an iterative approach. First calculate the first median. Then we take the index of the first number to remove, if the index of the number to remove is lower than the index of the current median, then the index for the new median increases by $1/2$ ( why? ) and if it is larger it is reduced by $1/2$. If it is the same then no change. Then we need some special handling to check that the final index will not accidentally be the same as one of the removed numbers. If we are into signal processing, we can do this with some kind of adaptive filtering, for instance local normalized averaging where we set the certainty of any removed element to 0 and for any not removed to 1.

0
On

This is NOT a complete answer, but I try to sketch my idea.

Suppose we have a sorted set, say, in an increasing order. If we want to maximise the difference between mean and median, it is necessary (but not sufficient) to keep as many outliers as possible (because mean is very much affected by these outliers, whereas median is not). The problems here are: how many is "many"? and how to control the difference?

Suppose $x_1 < x_2 < ... < x_n$. We can calculate its mean and median. My alogorithm (which I've not yet been able to prove so far) is as follows:

  1. If mean $>$ meadian: we delete the median or the closest number which is BIGGER than median, that is $x_{(n+1)/2}$, if $n $ is odd or $x_{n/2+1}$, if $n$ is even. The effect of doing this is: after each step, the mean (not always) gets bigger and bigger but the median (not always) gets smaller and smaller (and of course, mean is always bigger than median in every step). But overall, the difference gets bigger.
  2. If mean $<$ median: we delete the median or the closest number which is SMALLER than median, that is $x_{(n+1)/2}$, if $n $ is odd or $x_{n/2-1}$, if $n$ is even. The effect of doing this is: after each step, the mean (not always) gets smaller and smaller but the median (not always) gets bigger and bigger (and of course, mean is always smaller than median in every step). But overall, the difference gets bigger.

It turns out that, finally we will only have $3$ numbers left (can't be $2$, otherwise mean $=$ median): the biggest, the smallest, and the second smallest (if mean $>$ median) or the second biggest (if mean $<$ median). These $3$ numbers will have the maximum difference between mean and median.