Formulation and computation of "the" unique median of an even-sized list

616 Views Asked by At

Consider an even-sized set of numbers $X = \{x_k\}$, such as $X = \{1, 2, 7, 10\}$.
The median $m$ is defined as:

$$m = \mathrm{arg \min_x} \sum_k \lvert x_k - x\rvert^1$$

Any $m \in [2, 7]$ is a minimizer of this function, and is therefore "a" median of this list.
Now, it is common practice to take the average of 2 and 7 and call it "the" median.

But that's lame, and I think I have invented (?) a more logical way to find a unique median $m^*$:

$$m^* = \lim_{\epsilon \to 0^+} \mathrm{arg \min_x} \sum_k \left\lvert x_k - x\right\rvert^{1+\epsilon}$$

Differentiation to find the minimum only gets us so far:

$$\sum_k \mathrm{sgn}{\left(x_k - m^*\right)}\left\lvert x_k - m^*\right\rvert^{\epsilon} = 0$$ This expression can be solved numerically for smaller and smaller $\epsilon$ to give $m^* \approx 4.85$ in this example, and I suspect the "correct" median is in fact $m^* = 34/7$, but I don't know how to prove it.

I have 3 questions:

  1. First of all, is this a well-known and/or useful approach? Does it have a name?
    I came up with the new formulation myself, but I've never seen it used anywhere.

  2. Is there some way to directly find the exact value of $m^*$, without numerical optimization?
    If not, is there a better/faster approach than brute-force numerical optimization techniques?

  3. Is this a (convex?) optimization problem, and if not, can it be reformulated as one?
    The trouble here is that I can't find any objective function that has a unique minimum at $m^*$.
    The best I can do is to find a generalized function (i.e., the limit of another function), but when I do that, I don't think the problem is a convex optimization problem anymore.
    Is there another way to pose the problem that conforms better to existing optimization frameworks?

2

There are 2 best solutions below

7
On BEST ANSWER

If the values are sorted $x_1 \le x_2 \ldots \le x_n$, then the value $m^*$ is the unique solution on the interval $[x_{n/2},x_{n/2+1}]$ to the following equation:

$$ (m^* - x_1)(m^*-x_2)\ldots(m^*-x_{n/2})=(x_{n/2+1}-m^*)(x_{n/2+2}-m^*)\ldots(x_{n}-m^*). $$

When $n=2$, it's just the mean, and when $n=4$, $m^* = (x_3x_4-x_1x_2)/(x_3+x_4-x_1-x_2)$, which in your example is $(7\cdot 10-1\cdot2)/(7+10-1-2)=34/7$. I don't see any simple way to solve the equation in closed form for higher $n$, other than standard techniques for finding roots of polynomials.

To prove that the above equation defines $m^*$, you just need to go few steps further in the manipulation of the derivative. That is $ m^* = \lim_{\epsilon\rightarrow 0^+} m_\epsilon $, where $m_\epsilon$ is the solution to:

$$\sum_{k=1}^n \mathrm{sgn}{\left(x_k - m_\epsilon\right)}\left\lvert x_k - m_\epsilon\right\rvert^{\epsilon} = 0$$

For small $\epsilon$, we should have $x_{n/2}\le m_\epsilon \le x_{n/2+1}$, so this becomes:

$$-\sum_{k=1}^{n/2} (m_\epsilon-x_k)^{\epsilon} + \sum_{k=n/2+1}^{n} (x_k-m_\epsilon)^{\epsilon}=0$$

Expanding to first order in $\epsilon$ gives: $$-\sum_{k=1}^{n/2} (1+\epsilon \log(m_\epsilon-x_k)) + \sum_{k=n/2+1}^{n} (1+\epsilon\log(x_k-m_\epsilon))=O(\epsilon^2)$$

The constant terms cancel, and dividing by $\epsilon$ gives: $$-\sum_{k=1}^{n/2} \log(m_\epsilon-x_k) + \sum_{k=n/2+1}^{n} \log(x_k-m_\epsilon)=O(\epsilon)$$

Then by taking the limit as $\epsilon \rightarrow 0^+$, we get: $$-\sum_{k=1}^{n/2} \log(m^*-x_k) + \sum_{k=n/2+1}^{n} \log(x_k-m^*)=0$$ Which is equivalent to the stated condition.

There are different ways to approximate the 1-norm with a differentiable function, and each approximation will give a different unique "median". I don't know of any reason to prefer any one approximation over another other than convenience.

11
On

Unfortunately, this approach is not compatible with convex optimization in practice.

The reason is that in an optimization context, a convex function and its epigraph are assumed interchangeable. That is to say: consider the following two problems: $$\begin{array}{ll} \text{minimize} & f(x) \end{array}$$ $$\begin{array}{ll} \text{minimize} & y \\ \text{subject to} & f(x) \leq y \end{array}$$ These problems are equivalent if $f$ is convex: that is, given the solution to one, the solution to the other is evident, and vice-versa. Of course, the second one has an associated dual variable while the first one does not, but that doesn't change the equivalence.

Now let's consider your function for $f$, set in the second form above: $$\begin{array}{ll} \text{minimize} & y \\ \text{subject to} & \lim_{\epsilon\rightarrow 0^+} \sum_k | x_k - x |^{1+\epsilon} \leq y \end{array}$$ This is, in all practical respects, equivalent to $$\begin{array}{ll} \text{minimize} & y \\ \text{subject to} & \sum_k | x_k - x | \leq y \end{array}$$ which is of course what you'd get with the standard median function. Any practical system for optimization is really not going to be able to differentiate between the two forms. You could, of course, fix $\epsilon$ to be small and nonzero, but then you've destroyed equivalence, and of course made the leap from a linear problem to a nonlinear one.

Conceptually, what is happening here is that you are preferring a particular element of the arg min set over the rest. But establishing preferences among feasible points is precisely what the purpose of an objective function is. You need to find a way to integrate your preferences more directly into your objective or constraints. For instance, if you are determine to preserve the numerical results induced by the $|\cdot|^{1+\epsilon}$ approach---in particular, if it is important that $34/7$ be the correct answer in this example---then you will not be able to use this median function in convex optimization.