I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.
In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)
the difference operation on two sets results in a set of the elements in the first set but not in the second.
When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.
Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.
Moreover, comm and diff also work differently from each other.
What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)
Specifically, given two files, for each line in each file, how do comm and diff determine
- whether the line also occur in the other file?
- if it does, whether its occurrences in the two files are the same or different?
by taking into account the order between the lines in each file?
How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?
How do comm and diff differ when both are used for taking subtraction of two files?
Thanks.
Thanks.
I don't know about
comm, but the math name for whatdifffigures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.Although you do have to think about it kind of backwards, as the output of
diffconsists of all the parts that aren't in the longest common subsequence.Also, as to terminology, "ordered sets" isn't quite right, after all $\mathbb{R}$ is an ordered set. I'd describe them as (finite) sequences.