I'm trying to assess the similarity between 2 datasets containing integer numbers. Both datasets are of equal length (size). For this, I am first removing all the numbers which appear in both datasets (one-for-one).
Example:
$Dataset_1 = 0, 1, 2, 2, 3, 5, 7$
$Dataset_2 = 1, 2, 2, 2, 4, 6, 8$
After removing the matching number pairs:
$Dataset_1 = 0, 3, 5, 7$
$Dataset_2 = 2, 4, 6, 8$
Then, I am left with 2 datasets of non-matching numbers (i.e. don't appear in the opposite dataset). What I would like to do is for each number pick a number in the opposite dataset, find the absolute difference between these numbers ($|n_1 - n_2|$), and then sum up all the differences together to assess the similarity between both datasets. The problem is, which pairs do I pick to obtain the smallest result?
If you pick pairs $[0, 2], [3, 4], [5, 6]$ and $[7, 8]$ the sum of their differences would be:
$2 + 1 + 1 + 1 = 5$ (Correct)
However, if you pick pairs $[2, 3], [4, 5], [6, 7]$ and $[8, 0]$ you get:
$1 + 1 + 1 + 8 = 11$ (Incorrect)
Note: Please do not post answers involving other methods to assess the similarity. I'd like a solution to this pair-matching problem.
You can do no better than sorting both lists and matching them up in order. To prove this, suppose you have a matching that violates this ordering and show that you can improve it.
Explicitly, suppose $i_1<j_1$ in dataset 1 and $i_2<j_2$ in dataset 2, and the current matches are $(i_1,j_2)$ and $(j_1,i_2)$, with cost $|i_1-j_2|+|j_1-i_2|$. Use the triangle inequality to show that matching $(i_1,i_2)$ and $(j_1,j_2)$ instead would yield a smaller cost.