Algorithim to choose comparison pairs for topological sorting

189 Views Asked by At

I'm trying to find or create an algorithm to roughly sort arbitrary objects using pairwise comparison where the only concern is minimizing the number of comparisons. So my question is essentially is there any formula to determine which comparisons would yield more information about the correct order?

Note that I'm not well versed in math so I may be phrasing the question poorly, so I'll start with a picture:

Two disconnected graphs showing two lists of numbers in ascending order. First is 15, 20, 40, 82 Second is 2, 4, 6, 8, 10, 12, 14

Looking at the two lists above, we can see that comparing the head of $A$ to the tail of $B$ would give us a fully sorted list in one comparison. Comparing head to head would mean we have to go down the list making six comparisons. But if list $A$ had lower values, we would have to compare the opposite ends of each list to get the ideal merge. And the third case of course is that the lists are overlapping, so the potential cost in comparisons would be based on the length of each list.

In the real situation, merging would be between trees and not lists since the initial pairwise comparisons could result in nodes with multiple children before creating a run like the first example. For 12 elements paired randomly, then the winners compared we get this:

Three directed acyclic graphs showing the results of two rounds of pairwise comparisons

What to compare next? Comparing A and B puts A into the path where it belongs resulting in a sublist of 4 sorted items. Comparing C to D would give us a run of 6 sorted items if c < d, which seems like more information for one comparison. However, if d > c, then the graphs are connected but we haven't learned as much about the ordering.

Is there an algorithm to deal with this situation or a formula to determine which comparisons are best? (Where best means minimizing expected comparisons to fully merge the trees)

1

There are 1 best solutions below

0
On

As I predicted, I wasn't phrasing the question well (at least for a math context) so now that I've figured out how to deal with it, I'll break up the question into parts as I answer it.

1a. When merging two lists, which one comparison yields the most information?

This was essentially the main question and it's simple when you think about it. I don't know how to write it in proper information theory notation, so I'll just explain my logic. Binary insertion of one element eliminates half the list with every comparison, but with a list that doesn't help much because all elements of one list could belong between two elements of the second list. Below is a visual, showing on the bottom list how many possible comparisons remain to fully merge the lists if element $A_1$ is compared to that element. Where the numbers are if A_1 is < / if A_1 is >.

enter image description here

Easy to see that comparing $A_1$ to $B_n$ is best if you've only got one comparison.

There are actually proper algorithms for optimally merging two lists faster than a linear merge in $O(n+m)$ but they only work when $n/m$ is a fairly large number, something like $n$ being 8 times longer iirc. This is one, I lost the link for another paper that improves on it.

1a. What about when merging two trees?

There are algorithms for merging trees of course, but most of what I found was about more structured things like heaps. And naturally not about making only one or two comparisons. The answer for my case is again fairly simple. Just take the longest chain and treat it as a list like above. If you feel like doing more comparisons, there are various ways to improve the sorted result but they depend on your goal and I can't quantify them mathematically so I'll move on.

2a. What algorithm results in the best approximate sort for (some number) of comparisons?

In the original question by roughly sorted I meant "resulting in a generally increasing order" such that if values were graphed you could see a visible trend. Even with that, there are many ways to specifically define it: k-sorted, Spearman's rank correlation, etc. The way I phrased the question is similar to sorting with partial information, a complicated math problem without a simple solution.

A simpler question is "are you more interested in finding the top things quickly, or do you want everything evenly sorted". The former is the subject of partial sorting and selection algorithms, as well as many tournament systems (which are analogous to sorting without transitivity). For the latter, Algorithms for approximate sorting exist but those seem like overkill. There are statistical methods for ranking and preference learning but that's an entirely different approach from the start.

2b. A more accurate phrasing is How do I make sure no comparison is wasted?

This is where a graph is finally relevant. Tournament sort is a simple sorting algorithm that is functionally the same as Merge sort with the difference that instead of lists it uses trees. Each time an element "wins" it adds to a subtree of losers. After $N-1$ comparisons the first element is at the root of the tree and the rest of the tree should give a decent result when topologically sorted. Any additional comparisons can be used to refine the tree as desired. If modifying the algorithm to use the information from 1a, it may give better approximations in some cases but that's a question for some trial and error.

Apologies for the long post that's light on math. I thought I might as well take the time to write it all out in case it helps someone somewhere. Any mathematical corrections or additions are more than welcome.