diff is unintuitive
diff is a tool you use without even thinking about it. Given its age, you
assume it to be robust and to use the best algorithm. However, reading the
documentation about python diff library, I discover that it produces
unintuitive set of changes.
Let me show you.
This is a1.c :
if (a) {
// x
} else {
// y
}
This is b1.c:
if (a) {
// z
}
if (a) {
// x
} else {
// y
}
How would you intuitively go from a1.c to b1.c ? Obviously, one has added a
new statement "if (a) { // z }" in before the existing one.
However, if you pass it through diff, you will get:
--- a1.c 2003-04-29 13:28:15.000000000 +0200
+++ b1.c 2003-04-29 13:28:30.000000000 +0200
@@ -1,5 +1,9 @@
if (a) {
+ // z
+}
+
+if (a) {
// x
} else {
// y
According to diff, you have modified the inside of the then statement of the
test by adding some content, which contains half of the new then statement, an
end of statement and a new test.
Both results are correct. But one is more intuitive to the programmer than the
other.
The diff library of python would produce the intuitive result. How come ?.
The algorithm of the diff python library is to find repeatedly the Longest
Common Subsequence (LCS) of both files, until all the common parts have been
found. When there is no LCS left, everything remaining are differences. The
common code "if (a) { // x } else { // y }" gets identified as common and the
rest gets identified properly as difference.
The algorithm of diff is different. Basically, it scans the file as a stream
for common lines. When it encounters a difference, it looks forward to find
the next common lines. Everything in the middle is a difference. This is why
diff has hooked the first "if (a) {" as common to both file and had included
in its difference everything until the next common line.
So, don't trust your tools blindly. Although they may be old and respected and
robust, someone can still do better. Anyone wants to rewrite to diff so that
it produces the most intuitive result ? Given the importance of diffing in
producing software, producing more intuitive diff is an issue.
Philippe Fremy