Smallest sum of differences between the numbers in two datasets

43 Views Asked by At

I'm trying to assess the similarity between 2 datasets containing integer numbers. Both datasets are of equal length (size). For this, I am first removing all the numbers which appear in both datasets (one-for-one).

Example:

$Dataset_1 = 0, 1, 2, 2, 3, 5, 7$

$Dataset_2 = 1, 2, 2, 2, 4, 6, 8$

After removing the matching number pairs:

$Dataset_1 = 0, 3, 5, 7$

$Dataset_2 = 2, 4, 6, 8$

Then, I am left with 2 datasets of non-matching numbers (i.e. don't appear in the opposite dataset). What I would like to do is for each number pick a number in the opposite dataset, find the absolute difference between these numbers ($|n_1 - n_2|$), and then sum up all the differences together to assess the similarity between both datasets. The problem is, which pairs do I pick to obtain the smallest result?

If you pick pairs $[0, 2], [3, 4], [5, 6]$ and $[7, 8]$ the sum of their differences would be:

$2 + 1 + 1 + 1 = 5$ (Correct)

However, if you pick pairs $[2, 3], [4, 5], [6, 7]$ and $[8, 0]$ you get:

$1 + 1 + 1 + 8 = 11$ (Incorrect)

Note: Please do not post answers involving other methods to assess the similarity. I'd like a solution to this pair-matching problem.

1

There are 1 best solutions below

1
On BEST ANSWER

You can do no better than sorting both lists and matching them up in order. To prove this, suppose you have a matching that violates this ordering and show that you can improve it.

Explicitly, suppose $i_1<j_1$ in dataset 1 and $i_2<j_2$ in dataset 2, and the current matches are $(i_1,j_2)$ and $(j_1,i_2)$, with cost $|i_1-j_2|+|j_1-i_2|$. Use the triangle inequality to show that matching $(i_1,i_2)$ and $(j_1,j_2)$ instead would yield a smaller cost.