Inversed Cross Median of two arrays

Question

Inversed Cross Median of two arrays

44 Views Asked by Bumbble Comm At 10 May 2026 - 7:38

The problem I am facing is as follows : Given two arrays $A$ and $B,$ I would like to find a threshold $t,$ satisfying: the number of elements of $A$ that are less than $t$ equals the number of elements of $B$ that are greater than $t.$ I am pretty sure that the solution of my problem is the median of some array formed by $A$ and $B,$ but I don't see exactly how.

To generalize, we are looking for a threshold minimizing the difference between the number of $A$'s lesser elements and the number of $B$'s greater elements.

Thanks in advance for your replies.

Original Q&A

There are 3 best solutions below

Bumbble Comm On 25 Jun 2019 - 4:16

Well, the median of the concatenated arrays is certainly not the answer, nor is the median of the union of the arrays the answer, even if the arrays have the same length. The following simple Python code will disabuse you of that notion:

from random import randint
from statistics import median
import numpy as np

size_a = randint(10, 20)
a = [randint(0, 10) for __ in range(size_a)]
a.sort()

size_b = randint(10, 20)
b = [randint(0, 10) for __ in range(size_b)]
b.sort()

c = list(set(a + b))  // This is testing the union. For concat, eliminate
                      // the list and set calls and just do c = a + b.
c.sort()

print('a = ' + str(a))
print('b = ' + str(b))
print('median of c = a concat b is ' + str(median(c)))

a = np.array(a)  // Useful for conditional indexing.
b = np.array(b) 

print('Number of elements of a less than median: ')
print(str(len(a[a < median(c)])))

print('Number of elements of b greater than median: ')
print(str(len(b[b > median(c)])))

However, I think we can take this basic idea and tweak it a little to solve your problem. Idea: take the concatenated arrays c = a + b, sort it, and start with the median of the result. Then do a compare. If the number of elements of $a$ is larger, go earlier in $c$. Otherwise, go later. However, I would like to point out that not every choice of $A$ and $B$ will yield a solution. For example: \begin{align*} A&=[0, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10] \\ B&=[0, 0, 1, 1, 1, 1, 2, 3, 3, 7, 8] \end{align*} does not yield a solution. If you take $0, 0.5, 1, 1.5, 2, 2.5,$ and $3$ in turn, you will see that for none of these possibilities will the number of elements of $A$ less than the number equal the number of elements of $B$ greater than the number. Moreover, as you traverse the list of numbers I just gave, you will see a reversal in the sizes of the lists satisfying the desired criterion. The following Python code will find a threshold, if it exists, and an optimal threshold minimizing the difference between the number of elements of $A$ less than the threshold and the number of elements of $B$ greater than the threshold:

from random import randint
from statistics import median
import numpy as np

size_a = randint(10, 20)
a = [randint(0, 10) for __ in range(size_a)]
a.sort()

size_b = randint(10, 20)
b = [randint(0, 10) for __ in range(size_b)]
b.sort()

c = a + b
c.sort()

print('a = ' + str(a))
print('b = ' + str(b))
print('median of c = a concat b is ' + str(median(c)))

a = np.array(a)
b = np.array(b)
c = np.array(c)

t = median(c)
t_index = len(c[c < t])
iter_count = 0
diff = abs(len(a[a < t]) - len(b[b > t]))
best_t = t

while 0 < diff and iter_count < len(c):

    iter_count += 1

    if len(a[a < t]) > len(b[b > t]):
        t = np.mean(c[t_index-1:t_index+1])
        if len(a[a < t]) > len(b[b > t]):
            t_index -= 1
            t = c[t_index]
    else:
        t = np.mean(c[t_index:t_index+2])
        if len(a[a < t]) < len(b[b > t]):
            t_index += 1
            t = c[t_index]

    if abs(len(a[a < t]) - len(b[b > t])) < diff:
        diff = abs(len(a[a < t]) - len(b[b > t]))
        best_t = t

if 0 < diff:
    print('Could not find an exact threshold.')
    print('Optimal threshold was ' + str(best_t))
    print('Difference in set cardinalities was ' + str(diff))
else:
    print('Threshold is ' + str(best_t))

Bumbble Comm On 26 Jun 2019 - 8:14

This partial answer is thinking about the problem more theoretically. It's also very hand-waivy and not rigorous mathematics. This is to brain-storm. Let \begin{align*} A_{<t}&:=\{a\in A:a<t\},\\ A_{=t}&:=\{a\in A:a=t\},\\ B_{<t}&:=\{b\in B:b<t\},\\ B_{>t}&:=\{b\in B:b>t\},\;\text{and}\\ B_{=t}&:=\{b\in B:b=t\}. \end{align*} We use the notation $|C|$ to denote the cardinality of set $C.$ Note that $|B_{>t}|=|B|-|B_{<t}|-|B_{=t}|,$ assuming nothing in sight is infinite. Our goal is to find the $t$ that minimizes $\big| |A_{<t}|-|B_{>t}| \big|,$ or $$\min_{t}\sqrt{(|A_{<t}|-(|B|-|B_{<t}|-|B_{=t}|))^2}. $$ But the $t$ that minimizes this expression also minimizes without the square root: $$\min_{t}\,(|A_{<t}|+|B_{<t}|+|B_{=t}|-|B|)^2. $$ Let us assume we can differentiate this expression with respect to $t$ as follows: \begin{align*} \frac{d}{dt}\,(|A_{<t}|+|B_{<t}|+|B_{=t}|-|B|)^2&=2(|A_{<t}|+|B_{<t}|+|B_{=t}|-|B|)\,\frac{d}{dt}\,\left(|A_{<t}|+|B_{<t}|+|B_{=t}|-|B|\right). \end{align*} Now here is where we get some help from statistics. We can interpret both $|A_{<t}|$ and $|B_{<t}|$ as (un-normalized) cumulative probability distributions. We know that the "derivative" of a cumulative probability distribution is a probability density function (you can think of that as a counting function). So we can simplify the far derivative as follows: $$\frac{d}{dt}\,\left(|A_{<t}|+|B_{<t}|+|B_{=t}|-|B|\right)=|A_{=t}|+|B_{=t}|+\frac{d}{dt}\,|B_{=t}|. $$ The first two terms here are non-negative, so the only way for this expression to be zero is for $$\frac{d}{dt}\,|B_{=t}|<0. $$ So the probability density function $|B_{=t}|$ would need to be decreasing. Either that, or the optimal solution of $|A_{<t}|+|B_{<t}|+|B_{=t}|-|B|=0,$ which would clearly be the global minimum.

**Bumbble Comm** · Accepted Answer

Hehe. I'm adding a lot of answers, but I think each answer here has value, and represents a different approach. Here's an approach that uses sort of a cumulative distribution function for $A$, and a "de-cumulative" distribution function for $B$. The idea is to find the smallest point in either array, the biggest point in either array, construct an equal-sized grid of values for numbers in the arrays $A$ and $B$, do a difference, then find the min of the absolute value of the difference. Here's the Python code:

from random import randint
import numpy as np

size_a = randint(10, 20)
a = [randint(0, 10) for __ in range(size_a)]
a.sort()

size_b = randint(10, 20)
b = [randint(0, 10) for __ in range(size_b)]
b.sort()

a = np.array(a)
b = np.array(b)

min_ab = np.floor(min([min(a), min(b)])) - 1
max_ab = np.ceil(max([max(a), max(b)])) + 1

print(str(a))
print(str(b))

print(str(min_ab))
print(str(max_ab))

grid = np.arange(min_ab, max_ab + 1, 1)
print(str(grid))

a_cdf = np.array([len(a[a < t]) for t in grid])
b_ddf = np.array([len(b[b > t]) for t in grid])

diff = np.abs(a_cdf - b_ddf)

print('The best value occurs at ' + str(np.min(diff)))
print('The best threshold is ' + str(np.where(diff == np.min(diff))))

print(str(a_cdf))
print(str(b_ddf))
print(str(diff))

This code only uses implied loops, such as to find the a_cdf and b_ddf variables. Now the downer to this approach is that you don't necessarily know if your grid is fine enough. Right now, I just have it running on integers. You might find that if you use a fine enough grid, you can find solutions more readily than not. You can easily change the grid coarseness by changing the np.arange(min_ab, max_ab + 1, 1) call: make the last argument smaller.

Inversed Cross Median of two arrays

There are 3 best solutions below

Related Questions in MEDIAN

Related Questions in QUANTILE

Trending Questions

Popular # Hahtags

Popular Questions