Older blog entries for titus (starting at number 14)

Gravity's just a theory, too.

Creationists and those who firmly believe climate change isn't driven by humans miss the point: science isn't about providing certainty. It's about providing uncertainty.

Take gravity. Gravity is something that we can observe pretty easily just by dropping an apple. We can note correlations (massier planets seem to have larger gravitational fields, for example). We can guess that, since the flux per unit area through the surface of a sphere decreases as the inverse square of the sphere's radius, gravity is subject to the inverse square law. We can even posit underlying mechanisms linking gravity to a specific particle, like the Higgs boson. What we can't do is prove that we understand how gravity works, except in terms of other theories (like particle theory and general relativity). We also can't guarantee that gravity functions the same way (or at all!) in places out of our direct experimental reach -- we can just show that the cosmological motions we see match our expectations were gravity to work the same.

These are the same objections that people bring to evolution and climatology: we don't understand much about the underlying mechanisms in either area. We can't show that the same rules that we see operating today are the rules that operated 2,000 or 4,000 or 500,000 years ago. We can say that what we see in the fossil record and among living organisms today strongly suggests a single common ancestor for all life on earth; but we can't rule out the theory that God created the earth 6,000 years ago, because we don't have any objective observers from that time. We certainly can't demonstrate that human activity has caused climate warming, although there do seem to be significant correlations between human activity and climate change. (Note that correlation does not imply causation, though.)

So, why is gravity undisputed (except by Flat Earth people)? And why are climate change and evolution such hot topics? I'm not sure, but I can suggest a few reasons.

Gravity is undisputed today partly because no religion has made the precise mechanism a point of recent dispute. It used to be in dispute, though; remember Galileo? That, ultimately, was a dispute about gravity on the scale of our solar system. Yet no edicts about the Higgs boson, or general relativity, have emanated from the Catholic Church, and Bush doesn't seem to care about gravity.

Another reason that people don't argue much about gravity is that the theory of gravity is predictive. Given a comet's position and momentum, we can tell you pretty much where it's going to go. It's a little harder in atmosphere, but we do it very well -- think ballistic missiles, for example. This predictive power goes a long way towards quieting dissent with the theory, because if you can predict something people will generally believe you understand it pretty well. (We'll come back to this.)

Evolution, for better or for worse, is not in the same position. It's a major point of dispute in at least a few places, and it's not predictive in the least. Even worse, it can't be very specific in predictions, because it's a stochastic theory that is subject to historical contingency. We will never be able to predict what mutations will arise randomly, and we will probably never be able to predict what effect those mutations will have on ecosystems. We might be able to predict general trends, but that is still far away from being an exact science.

Climatology is a much younger science than either the physics of gravity or the study of evolution. Like evolution, and unlike gravity, it seems to be very sensitive to certain kinds of perturbations -- that is, it's "chaotic". Very small changes may have large effects elsewhere. Moreover we don't understand many of the basic processes very well, and we don't have good ways to measure even relatively simple things like energy input from the sun, much less complicated things like CO2 consumption. Climatology is certainly not a predictive science in general, although some things can be predicted, just like in evolution: if you know where a hurricane is today, you can guess pretty well where it's going to be tomorrow.

Climatology is also a big point of contention for economic reasons: global warming, in particular. Corporations don't want to reduce the emissions of greenhouse gasses because they believe that it will have a negative economic impact on them. Therefore they (or their proxies) attack global warming as an unproven theory, in order to undermine its impact on public policy. As with the religiously motivated attacks on evolution, this is definitely bad for science.

If we could predict climate, or predict the effects of evolution, presumably people would regard these theories as being more credible than they are now. Unfortunately it's impossible to turn evolution into a predictive theory, and it's going to be a while before we get a predictive handle on climatology. So both theories are amenable to attack on the charges of being "unproven".

And here we come to the nut: the scientific method can't prove anything, in general. It is is much, much better at disproving theories than it is at confirming them; any working scientist will agree with that! All that an honest scientist can say about gravity, or evolution, or global warming, is that they haven't been disproven yet. There are reasons to believe that gravity and evolution are pretty good theories, scientifically speaking, because they've withstood the test of time. I'm not very knowledgeable about climatology but I do know it's quite a bit shakier in its underpinnings. But attacking any of these theories for not having provided proof is missing the whole point of science, which is to disprove as much as possible.

People -- even many intelligent people who should know better -- frequently get this wrong. Michael Crichton, the prolific author of (among other books) Jurassic Park, gave an interesting lecture at Caltech where he talked about scientist's involvement in political debates on public policy. Nuclear winter and global warming were two examples where a strongly biased view has been pushed strongly and publicly by a relatively small cadre of scientists. Crichton's view seemed to be that scientists were no less fallible than anyone else, which is undeniable (though unpopular among scientists ;). What he missed, and what I think many scientists fail to emphasize, is that thus far the scientific method -- with objective measurements and peer review, in particular -- is the only proven method of discovery known to mankind. We ignore it at our peril.

Scientists can do their part by proudly admitting ignorance. It's not pleasant, but it's undeniable: did you know, for example, that the underlying mechanism by which evolutionary novelty arises is still in dispute? Yep! We still don't really understand how new traits arise! And did you know that the precise reflectivity of the earth -- which is a major determinant of energy input into our climate, and is directly linked to the "greenhouse effect" -- is still not easily measurable? Yep! No long-term trends available! And these are just two things I've worked on -- I'm sure there's an ocean of ignorance out there, just waiting to be publicized. That's science!

The flip side of the coin is that those who critically examine scientific theories should apply the same level of critical analysis to their own beliefs. This applies to postmodern lit-crit as much as it applies to religious believers -- and I think it's as important as science is, as a method for making public policy.

Note to readers: I've been thinking about writing something like this for a while. It's an ongoing project, so please e-mail me at titus@caltech.edu if you have thoughts, criticisms, or suggestions.

The only problem with troubleshooting is that trouble sometimes shoots back. -- Joe Zeff.

I've been noticing a fair amount of commentary on Python and Java lately: I particularly enjoyed Bruce Eckel's take on Static vs Dynamic typing, and Phillip Eby's Python Is Not Java (and Java Is Not Python, either). Phillip Eby makes the point that the Python and Java mindsets are quite different when it comes to frameworks: Python programmers tend to develop the structure out as they need it, while Java designers try to specify the frameworks' structure first & then fill in with specific implementations. Isn't this antithetical to the agile programming paradigm that's been gaining popularity lately?

Jython does a nice job of mingling Java libraries with Python coding; I think many of the Python-native extension modules can be loaded directly by Jython, too. Is this a possible solution to the question of static vs dynamic typing -- build your software in a language like Jython, and then slowly solidify it into Java?

I primarily do research programming, in which the specific goals of the software are largely undefined & the flexibility of the code should be one of the proximal design considerations, so I definitely prefer the Python(/Perl/Ruby) mindset in day-to-day work. There is a question in my mind, though, about where future bioinformatics software efforts will aim: I doubt that the current loosely-coupled/badly-specified project-specific protocols for genome databases and service frameworks will last, so where next? We could either start developing specifications (e.g. the distributed annotation system (DAS) or MAGE) or implementations (e.g. GMOD). If the former, there will be a significant barrier to entry for new projects, as they will need to spend time developing to the standard and confirming adherence. (This is the primary reason why DAS is a failure, I think.) If the latter, I predict a general tendency towards complexity of internal design as different projects try to cram all their needs into a single system. Either situation would be bad.

My preference is for what I think is a middle ground: the development of APIs around common tasks, in a variety of languages. The idea would be to take protocols like DAS and provide fairly simple library implementations that give you 90% of the needed functionality with 10% of the code complexity (based on the well known 90%/10% rule ;). The key is to make sure the implementations work well enough to do something useful & are in enough languages that e.g. the lone maverick Python/OCaml/Ruby programmer in the sea of Perl & Java programmers wants to play as well (just as one example!).

At the moment there are few tasks generic enough to be encapsulated by such an approach: the two that I can think of are annotation & microarray data presentation. Annotation suffers from a general lack of interoperability: not only does everyone have their own standards, but features don't transfer well between standards. I hear microarray data is the same, although I don't work with it much. It'd be interesting to try to work around the ontology problems (do you *really* want to define an ontology before getting your work done!?) to produce a genuinely useful annotation UI that interoperates. I don't see one out there that's usable by "mere" biologists, and I think that's the right target audience...

Why not use, say, XML? Well, properly grokking XML is burdensome and the whole process is pretty legalistic (lots of people yakking etc.). Since the goal is to lower ease of entry I think it's important to have some functioning libraries as soon as possible -- that way people can get the thrill of having the code actually work. When the library moves towards a standard, projects that are already functioning will at least have some reason to move with that library...

Hats off to the Chinook folks, who are developing a P2P bioinformatics system; you can access the code via CVS, finally.


Bugs bugs bugs bugs bugs...

Apparently this week is "let's find bugs in Titus's software" week. Didn't know it was formally defined... but three different people have poked holes in three different-but-related projects. The holes range from already-fixed-but-not-in-the-build (FRII), important-but-easy-to-fix (Cartwheel), and important-and-bloody-difficult-to-fix (paircomp). I have to say my users are really great: finding two of these bugs required great attention to detail. Thanks, guys!

The trickiest bug to fix involves finding transitive connections between three two-way comparisons (find all paths A-->B-->C such that for each path A-->B and B-->C and A-->C). I came up with a clever solution that was easy to understand and easy to implement in simple code; unfortunately, it falls apart in the face of reverse complementing. (As you may know, DNA is readable in two directions: AATTGGCC is equivalent to its reverse complement, GGCCAATT (complement: A <--> T, G <--> C).) This problem is compounded by the asinine data structure that I use to represent the matches. Looks like it's time for a serious refactoring...

All of these bugs remind me of this great quote from an interview with Damian Conway:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." -- Brian Kernighan, via Damian Conway

I really enjoyed reading this Damian Conway interview on builderau.com. This is a man who has done it all, and has sound advice based on experience. He also gives an excellent reason for using Perl: it's an immensely powerful language that lets you do pretty much whatever you want. (I don't think it's a good idea for inexperienced programmers to use Perl for anything more than short scripts, but -- like Python -- I suspect "short scripts" describes 95% of what is done with Perl ;).

In other news, my OCaml adventures proceed apace. I just finished my very first OCaml program (temp link). dd2.ml implements a simple recursive global-alignment algorithm that finds the optimum gapped alignment between two sequences. Dog slow, but functional (ha ha...)! Now to see if I can add some heuristics into the algorithm to make it speedier.

OCaml is a lot of fun, I must say. At some point I look forward to making use of OCaml's ability to ship cross-platform bytecode around to different machines. It'd be great to be able to add new alignment views and other analyses directly into FamilyRelationsII simply by downloading some new OCaml code! I've also been thinking about how to use OCaml in my tuple space/map-reduce implementation... seems like a good fit!

Last but not least: WSGI. There is now a Web site containing my Quixote and SCGI adapters for the Python WSGI standard. It also turns out I owe Ian Bicking an apology: when I asked why Webware didn't have an adapter, I'd missed Ian's WSGIKit implementation (SVN here, blog here). It's not an adapter so much as a reimplementation effort, as far as I can tell, so I still think there's room for a simple adapter that Just Works (tm). If experiments continue sucking maybe I'll work on that...

ta for now,

30 Nov 2004 (updated 30 Nov 2004 at 20:46 UTC) »
Stevey -- check out http://www.blogtorrent.com/, it might be what you're looking for. [UPDATE: no, it's not. Never mind.]

In other news, I just updated my QWIP/SWAP README with some simple usage examples, after trying them out with WSGI Utils. (They worked! (sort of)) Stupidly enough I previously posted a dated direct link to the qwip-swap .tar.gz, so I'm waiting 'til I can construct a Real Web Site for QWIP/SWAP to post the slightly updated distro.



'Vegetarian' -- it's an old Indian word meaning 'lousy hunter'.
              -- Red Green
30 Nov 2004 (updated 30 Nov 2004 at 08:04 UTC) »

''' There is a joke about American engineers and French engineers. The American team brings a prototype to the French team. The French team's response is: "Well, it works fine in practice; but how will it hold up in theory?" ''' -- unknown, via Mike Vanier.

OCaml, Python/WSGI, and scalable programming:

Spent some time over the last few days "learning" OCaml, by which I mean reading first the C++/Java programmer's intro to OCaml and then an OCaml tutorial. This is all part of an effort to broaden my horizons: I enjoy using Python and C to solve problems on a daily basis, but I've never learned a functional programming language. Man, is it frustrating to pick up a new language -- I feel completely helpless to even write even the simplest program. This is compounded by my complete inability to think recursively...

I'm looking into OCaml because several different computer-geek friends suggested I try it out. Since all of them profess a love of Python, yet are wiser and more experienced than I in the ways of programming languages (I guess a CS background is useful for something...) I decided to buckle down and study OCaml a bit. So far I've gained an appreciation for the cleverness of OCaml and OCaml programmers, marvelled at 'match', and realized how cool currying is. Not bad for two days ;).

In other news, David Warnock pointed out in his blog that my simple Thanksgiving Day WSGI wrapper for SCGI might be the best-performing WSGI server around, because it's built on top of mod_scgi/SCGI. mod_scgi/SCGI is already fully functional and used for "real" Web sites that run Quixote, and my leetle SWAP code effectively turns this into a full-blown WSGI server. Cool. It seemed too easy to implement, though, so I must be missing some aspect of the WSGI master plan -- why hasn't Webware done this yet, for example?

In connection with that, I've been thinking that an interesting project would be to implement an SCGI server in OCaml. I don't see anything like it out there on the projects page, and it wouldn't take that long to do...

Last but not least, as part of my OCaml adventure, I came across Mike Vanier's rant on the scalability of languages. In it he says, or implies, many things that I wish I could have said more clearly. Things like "The right way to use languages like C is to implement small, focused low-level components of applications written primarily in higher-level languages". Yeah, that.

Mike is one of the three people that suggested I learn OCaml, so I'm a bit saddened by his epilogue in which he turns a little bit away from OCaml (for good reasons, it sounds like, but nonetheless...)


QOTDE: Things Will Change -- Iain M. Banks, Against a Dark Background (the quote on Gorko's Tomb)

WSGI, Quixote, SCGI, QWIP, and SWAP

In a fit of depression over lousy experimental results, with a healthy serving of turkey on top, I decided to turn my hand to something I do better than experimental molecular biology: program in Python. (Trust me, whatever you think of my programming... my molecular biology is weaker. sigh.)

Pursuant to the general public prodding of various people on the Quixote list, I spent a few hours on the couch today and built two interfaces for WSGI, QWIP and SWAP. (README and source download.)

QWIP, the "Quixote-WSGI interface p(something)", wraps the Quixote publisher in a WSGI-compliant application object. This lets any WSGI-compliant servers out there (are there any?) publish Quixote objects.

SWAP, the "SCGI-WSGI application p(something)", allows the SCGI standalone server interface ('scgi server') to run WSGI-compliant applications. For example, this lets mod_scgi run WSGI applications via the SCGI server -- including QWIP-wrapped applications, which was my testing strategy ;).

Overall, my modicum of experience with the internals of Web servers (mostly from PyWX and some minor hacking on Quixote) served me well; it took me about 1 hr to get QWIP working, and about 3 hours to get SWAP working. (Over half of those three hours was spent figuring out that (1) I was instantiating a new object rather than calling the superconstructor, because I'd left out __init__; and (2) that SCGI expected the input and output streams to be closed to signal that the connection was over. Sigh.) It was pretty satisfying to sit back and set up this set of modules:

Apache <--> mod_scgi <--> SCGI server <--> SWAP <--> QWIP <--> Quixote demo
and have it all work!

I'm now moderately more optimistic about the usefulness of WSGI. I hate (no, loathe) frameworks that attempt to solve the problems of mankind, if you'll just drink this cool-aid sir... But, notwithstanding the philosophical debut in the WSGI PEP, it was pleasant to implement the adapters and I could see WSGI being of significant benefit to Web server authors. Or maybe by buying into the framework I've sold out and you can't trust my opinion ;).

So, kudos to Phillip Eby & I hope this stuff is useful to someone! Now, back to making my Quixote applications do more stuff!


p.s. Has anyone else noticed that advogato.com and www.advogato.com read cookies differently? Kind of amusing to go to one or the other and have different options available, one as logged-in member & the other as nobody...

QOTDE: "One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important." -- Bertrand Russell, via Timothy Foreman.

Academic publishing may not quite be ready for Open Source just yet...

When last we met our fearless hero, I'd talked about submitting an article on my software to BioTechniques. We got word back on Tuesday: editorial rejection prior to review. The reason? Lack of originality, because, to quote:

"As noted by the authors, the programs described in this manuscript are available online and already in wide use."

Silly me: I thought that having shown that the programs worked for a wide variety of biological systems was a good thing!

It seems that the logic-challenged people at BioTechniques only want unproven software published. If I convolute my own logic processor, I can understand this, sort of: why would anyone read an article about software that they're already using? Of course, the assumption going into that is that the sole purpose of publishing is to introduce people to completely novel results, not just something that most people won't have seen. It's certainly not like FRII is so widely used that Joe Developmental Biologist will have already seen it.

O well. On the advice of a friend more seasoned than I, I am re-submitting to BMC Bioinformatics, where I am told that functioning software is welcomed.

I do have to say that this little interaction has not raised my opinion of BioTechniques. The editors didn't bother sending it out for peer review, they simply slotted it into their narrow preconceptions of How Software Is Done and cut off the bits that didn't fit. At least I don't have to be upset with my peers; I can just call the BT editors "clueless" and move on!

There appear to be very few places to actually publish software. This is surprising, given how much biology is starting to depend on it in this new era of too much sequence. The standard technique is to do some moderately interesting bit of science using the software & then drop it into a moderately good journal like Genome Research. That's great -- if the software you're writing has some immediate scientific value that can be ascertained without experiments. If you need to do experiments, you're talking about a 6-12 mo wait before you can finish the experiments & then publish the software. Not exactly timely.

It's more troublesome that you can't expose your software to the Real World and publish it as novel once other people know about it. Next time I write a standalone piece of software I'll have to remember not to tell anyone else before publishing it...

Thursday morning miscellany

Johnny Bartlett, a fine member of this august site, asked me to pimp his book, although he acknowledges it's not a must-read for software architecture. The book is "Programming from the Ground Up"; not having read it, I am willing to pimp it not only by request but because Joel Spolsky recommends it. Go buy it.

A sinaesthetic friend sent me this fascinating article on tetrachromatic women. I think it's a very interesting philosophical exercise to contemplate what such people see & realize that we will never know. The article makes a big point of the adaptability of the human brain; I'm not that surprised, because it seems like the brain adapts to place people in their own political realities easily enough... ;) I guess that physical adaption on the level of new nerve pathways is moderately surprising, although not new: see this article, & search for "inverted".

Normally I hate reposting links without commentary, 'cause meme tracking has shown that everyone does it, so why should I waste my time? But sometimes you run across something so hilarious that you've just gotta share: sometimes you need a bigger tow truck than you originally thought. There was also a great story about crashing doorbells, but I won't repost that.

Support our troops (if you're from the US)

Last but not least, check out AnySoldier.com. Whether or not you believe in the war (I do think getting rid of Saddam was a good idea) or support our leader (are you kidding?), we should remember that the troops who are over there are generally good people who are in an uncertain combat environment fighting for their lives. It behooves us to support them, whatever you think of the people who sent them there.

A friend who is also a military nut said this about what units to contribute to:

On one hand, I'd suggest reserve/nat'l guard units - they being in a more protracted and stressful situation than they expected. On the other hand, reservists are more likely to have a better support structure from their families (more likely married). Active duty Army GIs and Marines are more likely to be single, 18-21 year olds. Obviously both have families, but, you know, direct support from a spouse and children in addition to the rest of the family is what I'm getting at.

So, send books & chocolate, and support 'em.


This diary entry dedicated to my synaesthetic friend Tamara.

QOTDE: "The only mistake you can make is to believe you cannot make mistakes." (via Carlos Gershenson)

Web software non-release & good books on software development

I'm waiting eagerly for our server admin to update Cartwheel to my latest version. We have two "working branches" on the SourceForge CVS repository, one for the Beowulf cluster configuration and one for the Web server configuration; to update, I simply merge the development branch into each of these branches and tell Ian (our server admin) to run 'cvs update' and restart. This time is a bit more complicated, because Python, psycopg, and Quixote all have new versions; I've added some new analysis programs (LAGAN and blastz); and I'm now using BioPython to parse NCBI BLAST output. BioPython, in particular, is a pain -- it's a big heap o' code, and it doesn't interact well with Quixote. O well, c'est la vie.

This update is pretty substantive: I added a bunch of new functionality to round out what was already there, then wrote it up in an article we're submitting to BioTechniques. (Let me know if you want a pre-acceptance copy.) I've been told that BioTechniques isn't the highest quality or highest impact journal, but I get the impression that it reaches a fairly wide audience of biologists. And that's my goal: to reach the users, not to publish a scientific article (got some of those on the way!). This paper is paper #2 of 4 dissertation papers, too, and it's nice to get it off my back. It's also the first paper where I'm corresponding author, which is pretty cool; for the non-academics out there, that signals that it's my project, not my advisor's.

I don't know when, if ever, I'll get around to an actual "release" of Cartwheel. There's no point as long as the one server that we run keeps up with demand; I don't think it's near to conking out, but I could be wrong. I've never stress-tested it, because it's not that kind of Web site... Maybe someday other people will start installing it and then I'll want to canonicalize the installation a bit more ;).

The kind of release we're doing now -- "here's the functionality, go play" -- is certainly the right thing for its current users, who are mostly GUI-using biologists. Anyone who wants to take a deeper look can do so via SourceForge; there's some moderately useful Web services APIs in there, for example.

...good books on software development

Rather than being critical of yahoo academic software development, I thought I'd be friendly today. Here's my list of good background books for software design. It's a very short list: Lakos's C++ book, Design Patterns, and Patterns of Software (+ some other links at the bottom). Fowler's Refactoring definitely belongs on there as well.

I regard these as "must-reads" if you're going to seriously think about writing even a moderately sized software project; if you read them and think "that was a waste of time..." you're either very experienced or you should do us a favor and not write any more software. In my not-so-humble opinion.

(Hmm, that wasn't very friendly. I've got to work on those anger issues, it seems.)

Please send me a private e-mail if you have additional suggestions; I'm always interested in good new books.


Today's diary entry dedicated to salmoni, who could use a little love... Stick with it, you'll find something!

QOTDE: "It is difficult to make predictions, especially about the future"

The write way to right re-usable bioinformatics tools.

It's frustrating how many fantastic bioinformatics analysis tools exist in a difficult-to-use form. Most of the algorithmically challenging tools I use exist only in command-line form; in fact, I can't think of a single sequence analysis program that has an external API. (I understand the situation may be slightly different in the area of clustering software, but that's not my biz at the moment.) A good external interface for NCBI BLAST or CLUSTALW would have saved me many hours.

It's not only the complex programs that suffer from this lack. One of my favorite whipping dogs is EMBOSS, a collection of many rather small command-line programs that do useful bits of analysis. They have tons of stuff, covering most anything you need to do in sequence analysis, but it's all locked away behind formats and stdin/stdout, and much of it is simply easier to re-write if you don't know how to use the program in the first place. In fact, I bet that over 90% of the programs in EMBOSS could be re-written from scratch in little more than a weekend using the scripting language of your choice (abbr "Python").

This is not an entirely idle contention; I rewrote part of fuzztran a few months ago. It took me 30 minutes -- not because I'm a fantastic programmer, but because I had a pattern-searching library that solved a more general problem. Here's what happened:

fuzztran uses a pattern language to search a database of nucleotide sequences after translation. It's useful in situations where you have a leetle bit of protein sequence -- say, from some Edman sequencing -- and want to search a genome or mRNA library for a match. This was exactly the situation I was in, but I needed to search a rather large library containing over 5 million sequences from a whole-genome shotgun sequencing effort on the sea urchin. Moreover, I needed to do an intersection of the results: I wanted to search for two substrings in proximity to each other.

I trundled on over to EMBOSS, read the fuzztran documentation, and tried running it. I immediately ran into several difficulties: it wasn't particularly fast; I didn't know if it was actually working, or if I had entered things in the wrong format; it didn't permit "percent mismatches", as in "find me sequences that match at the 90% level"; it was annoying to script; and the output format wasn't easily parseable.


I spent about 20 minutes trying to find an easy way to use the thing and finally decided that my time was better spent writing a specific tool for my needs. I ended up using my motility toolkit, which supports fuzzy pattern searching with position-weight matrices. I wrote a quick function to reverse-translate amino acids into codons, and thence into a position-weight matrix; once I had this "translate_protein_to_PWM" function written, the final code was very short:

for prot in protein_list:
    matrix = translate_protein_to_PWM(prot)
    length = len(matrix)
    pwm = motility.PWM(matrix)

# allow % mismatches min_score = length - int(float(length) * MISMATCH_PERCENT + 0.5)

print 'searching:', prot for sequence in sequences: if pwm.find(sequence, min_score): # save.

The code, together with testing and debugging, took a total of 30 minutes to write, and worked great -- we found the right protein & went on to verify it experimentally. (The tool is now in my slippy collection under "search-database-for-prot.py".)

Even better, this code was readily extensible to do other things, like mixed protein searches (where you've gotten mixed sequence, e.g. "RYAAGG" and "YGGGAR" were sequenced simultaneously and can't be deconvolved, so you need to search for [RY][YG][AG][AG][GA][RG]") and general domain searches. So that was nice.

OK. Ungapped fuzzy protein sequence searching is, in many senses, a toy problem. There are tons of ways to do it, I'd bet, and none of them would take very long to implement from scratch. The situation is more frustrating when you have to deal with the warts on something like water, which does a Smith-Waterman alignment. This is a moderately tricky piece of code, and reimplementing it isn't a good option for a short-term project. What would be great is if someone broke out the code that did the tricky bits -- the alignment itself -- from the code that worried about parsing input data and constructing output formats. To their credit, the EMBOSS people seem to have done this, but it's in a library that as far as I can tell isn't documented. So it's probably easiest for Joe Blow Bioinformatician to simply use the command-line program, with all of the clumsiness inherent in that approach.

I'd bitch less about the whole problem if it weren't that the EMBOSS folk, and the NCBI folk (who make BLAST), are paid for software development. As mjg59 points out, most analysis programs are written on research grants, where the short-term view outstrips the long-term view. Not so for EMBOSS, who apparently has a whole team of people writing this stuff. I just don't get it; Perl and Python are perfectly good scripting languages, and they're cross-platform; surely it would be easier to just provide a good embedding of the algorithmically challenging functions and then just write the individual programs as scripts??

O well. Some day I hope to rewrite BLAST and retool CLUSTALW to support a nice library API. 'til then, I guess I'll just gripe about the general problem here ;0).


12 Nov 2004 (updated 12 Nov 2004 at 16:19 UTC) »

QOTDE: "The lessons of history teach us -- if the lessons of history teach us anything -- that nobody learns the lessons that history teaches us." (R. Heinlein)

Use Python -- or a language like it. Plus, my savage hatred of "system()"

Hey, look -- a fan! Matthew, dontcha know that the best way to defeat trolls is to ignore them? Or was that giant advertising animatroids? I forget. (<-- gratuitous Simpson's reference.)

Quite apart from my drug problems (acid freak, not crackhead -- there's a difference!) and the gratuitous misreference to GUI programming (I agree completely! I hate GUIs even more than I hate command-line programs -- they're just useful, on occasion!) and the unfortunate failure of my former coauthors -- the swinish bastards! -- to recognize my contributions to the deep foundations of every paper on Avida, I have to agree that any statement recommending, say, Python over Perl, APL, Pascal, or COBOL as a solution is likely to be at best disingenuous and at worst just plain wrong. It is well-known that any Turing-complete language (given infinite memory, yada yada) can emulate any other -- so why choose between them?

Dunno. But, repetitive as it may be to say it, I think a large part of the solution to bad scientific programming is to use a language like Python. Seriously, I'm perfectly aware that Lincoln Stein (and likely Matthew Garrett) can kick my ass when it comes to a mano-y-mano, Perl-y-Python scripting contest. I'm even reasonably confident that Lincoln Stein could take me down in person; he looks mean. (I haven't met Matthew.) But to cite an N of 1 ("worked for me!") as an actual argument... well, I'm no math major but it seems like a large std deviation.

An argument that I might make, were I still slavishly and unreasonably devoted to Perl rather than to Python, would be to point out that anyone writing C extensions for Perl by hand without using SWIG and/or XSAPI probably has bigger problems than over-frequent enjoyment of a little crack. If that's the big problem with Perl, then it's not a problem at all.

This argument ignores the value of writing pseudocode instead of line noise, but that seems to be a personal preference rather than an absolute, for some reason...

And (seriously) Matt's point that this is a social problem is entirely correct. Teaching people Python at an earlier age might help there. ;)

...why "system()" sucks.

But let's move on to a different argument: my savage hatred of "system()". Do an experiment: try writing a parser for the "generic" GFF format. What, you say? That's easy? Sure is -- for each and every one of the bajillion programs that output GFF, it's easy! Now, let's see which field(s) they overloaded this time...

The problem, to put it bluntly, is formats. In information theoretic terms, stdout is often a very lossy channel, and it is difficult (and often impossible) to make it 100% clean. Why? Well, suppose someone gives you some brilliantly written (and novel) standalone piece of code, and it takes in sequences in FASTA format together with a couple of parameters. Now the program does some fantastically complex set of calculations -- gene finding, HMM search, Gibbs sampling, sequence alignment -- and spits out some text as a result. That's right -- some text. What does the text mean? At this point the hapless user of a novel program has several options. S/he can:

  1. write a one-off parser that grabs the necessary data and runs.
  2. write a complete parser that parses all of the output and puts it into a nice structure for later use.
  3. hope like hell that the author of the program provided a "standard" format like GFF that captures some significant component of the output.
  4. wait for someone more anal retentive (or needier, or smarter, or harder-working) to write a really good parser for the format.

Libraries like BioPerl or BioPython give you #3 and #4 (with time). #2 takes a lot of effort and is only worth it when you really need all of the info in the output. #1 is what everybody does, in practice, right up until it bites 'em in the butt.

There's one huuuuuge problem with all of this, however: you're at the mercy of the author of the package to provide full, honest information in the output. Well, good luck with that, and have a good time rewriting your parser when Joe Package Author decides that semicolons are a better divider than commas...

It should be obvious that the best solutions above (#2/#4) can only ever be as good as a good embedding of the package in your SLOC (Scripting Language Of Choice). And, far too often, the actual parsing solution isn't that good, and can't be extended without breaking everybody else's parsers. That's why command-line executables with no associated library or embedding will, to a general and somewhat loose approximation, always suck.

So, people: use Python. Or COBOL. And write library functions loosely wrapped in main()s, not deeply embedded spaghetti code.


The shoutout today goes gnutizen, who obviously has his own drug issues; he certified me as "Journeyer"!

p.s. It turns out I was a math major. Huh. Weird.

p.p.s. If someone with some Perl and C/C++ knowledge were to go comment on my SWIG/Perl embedding of motility (see the CVS) it could be most useful to me. Just a thought.

p.p.p.s. In the bioinformatics language wars, I have to say that Bioconductor really takes the cake in the "absurdity" category. I personally like R, but why someone would choose it over a more mainstream language for general-purpose programming <shakes head>...

5 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!