Optimal permutation for differential storage

58 Views Asked by At

This is a practical problem I encountered recently. I'm convinced that it has been solved before, however I don't know where to start looking as I'm unfamiliar with all the terminology involved. Hope someone can point me in the right direction :)

Suppose you have n files of various sizes. You can use a computer program to obtain a diff-file, which contains the difference between two files. Depending on the similarity of file oA and oB, this diff dAB could be very small. The size of a diff is not affected by file order; dAB is the same size as dBA. However, the diffs are monodirectional: To reconstruct oA you would need oB and dBA, as dAB would be of no use.

You want to find the optimal permutation of diffs, such that by starting from just one of the original files, you can recreate all the other files. The size of all possible diff files are already known. For example, given files oA, oB, oC, oD and oE, the optimal permutation could be dAB, dAC, dBD, dCE.

1

There are 1 best solutions below

0
On BEST ANSWER

You can approach this as a minimal spanning tree problem. Create a complete graph with the vertices corresponding to your files. Label each edge with the size of the diff file between the two files; as you point out these diffs are the same regardless of direction. You can then use Kruskal's algorithm to find a minimal weight spanning tree.

To actually pick which files you want to keep, start with the smallest file you have, say A. For each of A's edges in the minimal weight spanning tree, pick the diff file which transitions A to the file corresponding to the next vertex. Then from each of these vertices, take each diff file moving outward.

So for your example, your spanning tree would look like:

A -- B -- D
|
C -- E