14 Nov 2004 titus   » (Journeyer)

QOTDE: "It is difficult to make predictions, especially about the future"

The write way to right re-usable bioinformatics tools.

It's frustrating how many fantastic bioinformatics analysis tools exist in a difficult-to-use form. Most of the algorithmically challenging tools I use exist only in command-line form; in fact, I can't think of a single sequence analysis program that has an external API. (I understand the situation may be slightly different in the area of clustering software, but that's not my biz at the moment.) A good external interface for NCBI BLAST or CLUSTALW would have saved me many hours.

It's not only the complex programs that suffer from this lack. One of my favorite whipping dogs is EMBOSS, a collection of many rather small command-line programs that do useful bits of analysis. They have tons of stuff, covering most anything you need to do in sequence analysis, but it's all locked away behind formats and stdin/stdout, and much of it is simply easier to re-write if you don't know how to use the program in the first place. In fact, I bet that over 90% of the programs in EMBOSS could be re-written from scratch in little more than a weekend using the scripting language of your choice (abbr "Python").

This is not an entirely idle contention; I rewrote part of fuzztran a few months ago. It took me 30 minutes -- not because I'm a fantastic programmer, but because I had a pattern-searching library that solved a more general problem. Here's what happened:

fuzztran uses a pattern language to search a database of nucleotide sequences after translation. It's useful in situations where you have a leetle bit of protein sequence -- say, from some Edman sequencing -- and want to search a genome or mRNA library for a match. This was exactly the situation I was in, but I needed to search a rather large library containing over 5 million sequences from a whole-genome shotgun sequencing effort on the sea urchin. Moreover, I needed to do an intersection of the results: I wanted to search for two substrings in proximity to each other.

I trundled on over to EMBOSS, read the fuzztran documentation, and tried running it. I immediately ran into several difficulties: it wasn't particularly fast; I didn't know if it was actually working, or if I had entered things in the wrong format; it didn't permit "percent mismatches", as in "find me sequences that match at the 90% level"; it was annoying to script; and the output format wasn't easily parseable.


I spent about 20 minutes trying to find an easy way to use the thing and finally decided that my time was better spent writing a specific tool for my needs. I ended up using my motility toolkit, which supports fuzzy pattern searching with position-weight matrices. I wrote a quick function to reverse-translate amino acids into codons, and thence into a position-weight matrix; once I had this "translate_protein_to_PWM" function written, the final code was very short:

for prot in protein_list:
    matrix = translate_protein_to_PWM(prot)
    length = len(matrix)
    pwm = motility.PWM(matrix)

# allow % mismatches min_score = length - int(float(length) * MISMATCH_PERCENT + 0.5)

print 'searching:', prot for sequence in sequences: if pwm.find(sequence, min_score): # save.

The code, together with testing and debugging, took a total of 30 minutes to write, and worked great -- we found the right protein & went on to verify it experimentally. (The tool is now in my slippy collection under "search-database-for-prot.py".)

Even better, this code was readily extensible to do other things, like mixed protein searches (where you've gotten mixed sequence, e.g. "RYAAGG" and "YGGGAR" were sequenced simultaneously and can't be deconvolved, so you need to search for [RY][YG][AG][AG][GA][RG]") and general domain searches. So that was nice.

OK. Ungapped fuzzy protein sequence searching is, in many senses, a toy problem. There are tons of ways to do it, I'd bet, and none of them would take very long to implement from scratch. The situation is more frustrating when you have to deal with the warts on something like water, which does a Smith-Waterman alignment. This is a moderately tricky piece of code, and reimplementing it isn't a good option for a short-term project. What would be great is if someone broke out the code that did the tricky bits -- the alignment itself -- from the code that worried about parsing input data and constructing output formats. To their credit, the EMBOSS people seem to have done this, but it's in a library that as far as I can tell isn't documented. So it's probably easiest for Joe Blow Bioinformatician to simply use the command-line program, with all of the clumsiness inherent in that approach.

I'd bitch less about the whole problem if it weren't that the EMBOSS folk, and the NCBI folk (who make BLAST), are paid for software development. As mjg59 points out, most analysis programs are written on research grants, where the short-term view outstrips the long-term view. Not so for EMBOSS, who apparently has a whole team of people writing this stuff. I just don't get it; Perl and Python are perfectly good scripting languages, and they're cross-platform; surely it would be easier to just provide a good embedding of the algorithmically challenging functions and then just write the individual programs as scripts??

O well. Some day I hope to rewrite BLAST and retool CLUSTALW to support a nice library API. 'til then, I guess I'll just gripe about the general problem here ;0).


Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!