How shall I understand what the GNU utilities "comm" and "diff" do in terms of ordered sets?

112 Views Asked by At

I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.

In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)

  • the difference operation on two sets results in a set of the elements in the first set but not in the second.

  • When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.

Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.

Moreover, comm and diff also work differently from each other.

What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)

Specifically, given two files, for each line in each file, how do comm and diff determine

  • whether the line also occur in the other file?
  • if it does, whether its occurrences in the two files are the same or different?

by taking into account the order between the lines in each file?

How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?

How do comm and diff differ when both are used for taking subtraction of two files?

Thanks.

Thanks.

2

There are 2 best solutions below

2
On

I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.

Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.

Also, as to terminology, "ordered sets" isn't quite right, after all $\mathbb{R}$ is an ordered set. I'd describe them as (finite) sequences.

9
On

$\mathtt{comm}$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.

$\mathtt{diff}$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $\mathtt{diff}$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.