12 Nov 2004 titus   » (Journeyer)

QOTDE: "The lessons of history teach us -- if the lessons of history teach us anything -- that nobody learns the lessons that history teaches us." (R. Heinlein)

Use Python -- or a language like it. Plus, my savage hatred of "system()"

Hey, look -- a fan! Matthew, dontcha know that the best way to defeat trolls is to ignore them? Or was that giant advertising animatroids? I forget. (<-- gratuitous Simpson's reference.)

Quite apart from my drug problems (acid freak, not crackhead -- there's a difference!) and the gratuitous misreference to GUI programming (I agree completely! I hate GUIs even more than I hate command-line programs -- they're just useful, on occasion!) and the unfortunate failure of my former coauthors -- the swinish bastards! -- to recognize my contributions to the deep foundations of every paper on Avida, I have to agree that any statement recommending, say, Python over Perl, APL, Pascal, or COBOL as a solution is likely to be at best disingenuous and at worst just plain wrong. It is well-known that any Turing-complete language (given infinite memory, yada yada) can emulate any other -- so why choose between them?

Dunno. But, repetitive as it may be to say it, I think a large part of the solution to bad scientific programming is to use a language like Python. Seriously, I'm perfectly aware that Lincoln Stein (and likely Matthew Garrett) can kick my ass when it comes to a mano-y-mano, Perl-y-Python scripting contest. I'm even reasonably confident that Lincoln Stein could take me down in person; he looks mean. (I haven't met Matthew.) But to cite an N of 1 ("worked for me!") as an actual argument... well, I'm no math major but it seems like a large std deviation.

An argument that I might make, were I still slavishly and unreasonably devoted to Perl rather than to Python, would be to point out that anyone writing C extensions for Perl by hand without using SWIG and/or XSAPI probably has bigger problems than over-frequent enjoyment of a little crack. If that's the big problem with Perl, then it's not a problem at all.

This argument ignores the value of writing pseudocode instead of line noise, but that seems to be a personal preference rather than an absolute, for some reason...

And (seriously) Matt's point that this is a social problem is entirely correct. Teaching people Python at an earlier age might help there. ;)

...why "system()" sucks.

But let's move on to a different argument: my savage hatred of "system()". Do an experiment: try writing a parser for the "generic" GFF format. What, you say? That's easy? Sure is -- for each and every one of the bajillion programs that output GFF, it's easy! Now, let's see which field(s) they overloaded this time...

The problem, to put it bluntly, is formats. In information theoretic terms, stdout is often a very lossy channel, and it is difficult (and often impossible) to make it 100% clean. Why? Well, suppose someone gives you some brilliantly written (and novel) standalone piece of code, and it takes in sequences in FASTA format together with a couple of parameters. Now the program does some fantastically complex set of calculations -- gene finding, HMM search, Gibbs sampling, sequence alignment -- and spits out some text as a result. That's right -- some text. What does the text mean? At this point the hapless user of a novel program has several options. S/he can:

  1. write a one-off parser that grabs the necessary data and runs.
  2. write a complete parser that parses all of the output and puts it into a nice structure for later use.
  3. hope like hell that the author of the program provided a "standard" format like GFF that captures some significant component of the output.
  4. wait for someone more anal retentive (or needier, or smarter, or harder-working) to write a really good parser for the format.

Libraries like BioPerl or BioPython give you #3 and #4 (with time). #2 takes a lot of effort and is only worth it when you really need all of the info in the output. #1 is what everybody does, in practice, right up until it bites 'em in the butt.

There's one huuuuuge problem with all of this, however: you're at the mercy of the author of the package to provide full, honest information in the output. Well, good luck with that, and have a good time rewriting your parser when Joe Package Author decides that semicolons are a better divider than commas...

It should be obvious that the best solutions above (#2/#4) can only ever be as good as a good embedding of the package in your SLOC (Scripting Language Of Choice). And, far too often, the actual parsing solution isn't that good, and can't be extended without breaking everybody else's parsers. That's why command-line executables with no associated library or embedding will, to a general and somewhat loose approximation, always suck.

So, people: use Python. Or COBOL. And write library functions loosely wrapped in main()s, not deeply embedded spaghetti code.


The shoutout today goes gnutizen, who obviously has his own drug issues; he certified me as "Journeyer"!

p.s. It turns out I was a math major. Huh. Weird.

p.p.s. If someone with some Perl and C/C++ knowledge were to go comment on my SWIG/Perl embedding of motility (see the CVS) it could be most useful to me. Just a thought.

p.p.p.s. In the bioinformatics language wars, I have to say that Bioconductor really takes the cake in the "absurdity" category. I personally like R, but why someone would choose it over a more mainstream language for general-purpose programming <shakes head>...

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!