29 Apr 2003 pfremy   » (Journeyer)

diff is unintuitive

diff is a tool you use without even thinking about it. Given its age, you assume it to be robust and to use the best algorithm. However, reading the documentation about python diff library, I discover that it produces unintuitive set of changes.

Let me show you.

This is a1.c :

if (a) {
	// x
} else {
	// y
}

This is b1.c:

if (a) {
	// z
}

if (a) { // x } else { // y }

How would you intuitively go from a1.c to b1.c ? Obviously, one has added a new statement "if (a) { // z }" in before the existing one.

However, if you pass it through diff, you will get:

--- a1.c 2003-04-29 13:28:15.000000000 +0200 +++ b1.c 2003-04-29 13:28:30.000000000 +0200 @@ -1,5 +1,9 @@

if (a) { + // z +} + +if (a) { // x } else { // y

According to diff, you have modified the inside of the then statement of the test by adding some content, which contains half of the new then statement, an end of statement and a new test.

Both results are correct. But one is more intuitive to the programmer than the other.

The diff library of python would produce the intuitive result. How come ?. The algorithm of the diff python library is to find repeatedly the Longest Common Subsequence (LCS) of both files, until all the common parts have been found. When there is no LCS left, everything remaining are differences. The common code "if (a) { // x } else { // y }" gets identified as common and the rest gets identified properly as difference.

The algorithm of diff is different. Basically, it scans the file as a stream for common lines. When it encounters a difference, it looks forward to find the next common lines. Everything in the middle is a difference. This is why diff has hooked the first "if (a) {" as common to both file and had included in its difference everything until the next common line.

So, don't trust your tools blindly. Although they may be old and respected and robust, someone can still do better. Anyone wants to rewrite to diff so that it produces the most intuitive result ? Given the importance of diffing in producing software, producing more intuitive diff is an issue.

Philippe Fremy

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!