The Future of Bioinformatics (in Python), part 1 (b)
My last post initiated a discussion on the biology-in-python mailing list about BioPython, among other things. (Here is a link to the discussion, which is kind of long and unfocused.)
I'm happy that the bip list is serving as a place for people to interact with the BioPython maintainers to discuss the future of BioPython. Hopefully it will lead to more involvement with BioPython, which would be a good thing.
However, I would like to take the time to question the longer-term utility of the BioPython/Perl approach.
Bioinformatics -- by which I mostly mean sequence analysis -- has predominantly followed the UNIX scripting/pipeline model, in which data is kept in simple, easily-manipulated formats (comma- or tab-separated values, or CSV) and then processed incrementally. This approach has a number of advantages:
- Each step is isolated and so easier to understand.
- Each step produces a simple, easy-to-parse kind of data.
- Each step is language neutral (anything can read CSV).
- New programmers can learn to use each step in isolation.
- The components are re-usable.
I've used this exact approach for well over a decade: first for analyzing Avida data, then Earthshine, and most recently scads of genome data.
This scripting & pipeline approach is what BioPython and BioPerl facilitate. They have a lot of tools for running programs to produce data and loading in different formats, and they serve as a good library for this purpose.
The scripting/pipeline approach does have some deficiencies as a general data-analysis approach:
- Poor (O(n)) scalability: processing CSV files is hard to do supra-linearly, and often the easiest analysis approach is actually O(n**2)
- Hard to test: generally people do not test their scripts. Even now that I've become test infected, I find scripts to be more difficult to test than modules and libraries. I can do it, but it's not natural for me, and empirical evidence suggests that it's not natural for most people.
- Hard to re-use: scripts are often quite fragile with respect to assumptions about input data, and these assumptions are rarely spelled out or asserted within the code. This leads to hard-to-diagnose errors that often occur deep within the tool chain (if they ever show up explicitly).
- Poor metadata support: try attaching metadata to a CSV file. You'll end up with something like GFF3, which overloads the metadata field to mean something slightly different with each database. Awesome.
- Too easy to map into SQL databases: yes, you can load CSV files into SQL databases, but JOINs are a relatively rare form of actual data analysis -- and that's what SQL databases are best at. SQL databases do a particular poor job of interval analysis (overlap/nearest neighbor extraction/etc.)
- Poor abstraction: when you load something into memory from a CSV file, it's easy to treat it as a list. Lists are, generally, a poor way to interact with sequence annotations. (This is really the same problem as the SQL database problem.)
- Poor user interface: it's hard to put lipstick on a script! People who aren't comfortable with UNIX and file munging (i.e. most biologists) have a hard time using scripts, and it's rather difficult to wrap a script in a GUI or Web site.
- Poor reproducibility: every scientist I know has trouble keeping track of what parameters they used last time they ran a script. Even if they keep track of things in a lab notebook, that's a poor medium for reference; logging and notebook software don't seem to work very well for this, either.
These deficiencies didn't bother me too much when I was first interacting with genomic data, but they've become glaringly apparent in the face of massively parallel sequencing data. The advent of 454 and (particularly) Solexa sequencing data, where you can get tens of millions of short reads from a DNA sample, means that scalability concerns dominate; the ready availability of such data means that everyone has some and needs to analyze it, and they want good, fast, correct tools to do so. In the struggle to cope with this data, things like maq emerge, which uses a largely opaque intermediate data format to make Solexa data analysis scalable; this ends up being a bit of an intermediate model, where you query and manipulate maq databases from the command line. It can be scripted, but it doesn't have the advantages of language neutrality or easy parse-ability, and so you lose some of the advantages of scripting. Since maq doesn't really work as a programming library, either, you don't gain the benefits of abstraction (it's designed on extract-transform-load model where you run each command as an isolated operation). There are lots of pieces of bioinformatics software like this: they solve one problem well, but they're not built to output data that can be easily combined with data from another program -- at which point you run into format and scripting issues.
For me, the deficiencies of the scripting model largely come down to the lack of an abstraction layer that separates how the data is stored from how I want to query and manipulate the data. The introduction of a good abstraction layer immediately potentiates re-usability, because now I can separate data loading from data query and start using objects to build queries. It also makes scalability a matter of building a good, general solution once, or perhaps building specific solutions that all look the same at the API level. Once the API is firm, it's relatively straightforward to test; once I can separate the API from implementation I can implement different backend storage and retrieval mechanisms as I like (pickle, SQL, whatever); and I can build a GUI interface without having to change the internals every time I change data storage types or analysis algorithms.
On the flip side, once move into a framework, you now have the problem that you're coding at a level well above most newbie programmers and biologists, so ease-of-use becomes a real issue. This means that people need good documentation and good tutorials, in particular -- the Achilles Heel of open source & academic software. And, of course, the framework has to actually work well and solve problems well enough to reward the casual scientist who needs a tool.
So, with respect to BioPython, I appreciate the functionality it has, but I think the model is wrong for my work (and for work in a world full of genomes and sequence). What I really want is a complete solution stack for sequence analysis and annotation:
data storage
--
object layer
--
scripting layer
--
user interface tools
I'm out of time now, but next installment, I'll talk about how pygr provides much of this "solution stack" for me.
If you're interested in a longer, more detailed version of much the same argument, see Chris Lee's paper with Stott Parker and Michael Gorlick, Evolving from Bioinformatics-in-the-Small to Bioinformatics-in-the-Large.
For a recent overview of pygr's functionality, see the draft paper, Pygr: A Python graph framework for highly scalable comparative genomics and annotation database analysis.
--titus
The Future of Bioinformatics (in Python), part 1 (a)
Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.
First, why is Python not more dominant in bioinformatics? I really lay this at the feet of Lincoln Stein, who (from what I can tell) was the dominant force behind BioPerl in the early days. So it worked really well and attracted all sorts of attention and users and actual use. However, I think the tide is shifting away from Perl: from the not-so-imminent release of a complex, backwardsly-incompatible Perl 6, to the massive quantities of completely non-reusable Perl code that have been flung in every direction, people are starting to get sick of Perl. also, a lot of people in academia are moving towards Python for bioinformatics, if not in a very coordinated way: when I left Caltech, two of the three heavy bioinformatics groups were using Python, and when I arrived at MSU I found several groups doing bioinformatics in Python and only one using Perl (and, at that, mainly because they rely on GMOD).
Heck, there are a lot of Python-in-bio sightings these days. I just went to a talk by Rob Knight, who works on the human microbiome project, and he mentioned developing PyCogent with some collaborators. A lab on campus uses TAMO for motif searching. Cistematic and a variety of tools from the Wold Lab use Python. James Taylor is working hard on developing Galaxy into a general purpose tool. So I don't despair for Python's presence in biology.
I think the world is moving, medium-to-long-term, towards the use of Perl for scripting-level work, Python for frameworks and re-usable software, and R for statistical analysis of data sets (BioConductor is also popping up a lot these days). Personally I think this is the right approach and bodes well in the long term.
--
Second, Chris says,
I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.
I, too, am struggling with the problem that research scientists, somewhat shockingly, are more interested in doing (and funding) novel research than in building re-usable software. OK, I'm being a bit sarcastic, but that's only a mildly sarcastic statement, really; while it's understandable that researchers want to do research, the rise of large-scale data and computational methods in biology unambiguously argues for computational competence in the next generation of researchers. Part of computational competence is knowing how to get stuff done effectively and correctly, not to mention with reusable software when possible. I am actually shocked that there's so little focus on Software Carpentry-like skills in science and education, and I'm doing my best to push on that front here at MSU (see my very first course here, which is introducing Python, Subversion and automated testing to CSE undergrads).
That computing in biology sucks is not by any means a novel observation; see this nice article, Computational Biology Resources Lack Persistence and Usability, for example. My take on things is that the funding bodies simply need to recognize the utility of software maintenance, which is slowly happening, and that the undergrad and graduate departments need to adapt to the future by teaching this stuff. But there's no question it's going to be Darwinian out there -- as Stewart Brand says, "Once a new technology starts rolling, if you're not part of the steamroller, you're part of the road." Hopefully some of us can be the steamroller and not the road, yeh?
So what's my solution, you might ask? Well, now that I'm a bigshot professor, I'm going to be encouraging (well, demanding where possible) that my students and collaborators use good software development techniques and release their source code and data. But my real "secret" -- and please steal it if you can :) -- is that I hope to continue building a real infrastructure that can underlie solutions to my various research problems. If I can build a re-usable core of well-tested tools on top of a solid framework, I should be able to do research faster, better, smarter, and more reliably than my colleagues and competitors. That should translate into more publications, more grants, and more problems actually solved. (I'll let you know how that goes; it's early days still.)
That, incidentally, is why you should ignore people who tell you not work on your coding or on general-purpose libraries: because if it's useful to you, it's worth doing right and making it useful to yourself in the future.
This is also one of the reasons why I'm investing a substantial amount of my scarcest resource (time) in pygr. pygr is a solution for scalable storage, retrieval, and named persistence of sequence-associated data, and it works fantastically well. The real problem with pygr is the high barrier to entry, and that's what we're working on lowering, if only so that my own students will have less trouble learning it.
Some other time I'll talk about why pygr and pygr-like solutions are the right solution to reusability in bioinformatics.
So, in summary: don't worry, be happy, Python is coming to bioinformatics one way or another. And don't worry, just work hard at becoming the steamroller (and not the road) by improving your coding skills and becoming a general-purpose computational scientist, or at least general-purpose bioinformatician. You won't regret it.
Heck, you can always come work for me, right? ;)
--titus
The Future of Bioinformatics (in Python), part 1 (a)
Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.
First, why is Python not more dominant in bioinformatics? I really lay this at the feet of Lincoln Stein, who (from what I can tell) was the dominant force behind BioPerl in the early days. So it worked really well and attracted all sorts of attention and users and actual use. However, I think the tide is shifting away from Perl: from the not-so-imminent release of a complex, backwardsly-incompatible Perl 6, to the massive quantities of completely non-reusable Perl code that have been flung in every direction, people are starting to get sick of Perl. also, a lot of people in academia are moving towards Python for bioinformatics, if not in a very coordinated way: when I left Caltech, two of the three heavy bioinformatics groups were using Python, and when I arrived at MSU I found several groups doing bioinformatics in Python and only one using Perl (and, at that, mainly because they rely on GMOD).
Heck, there are a lot of Python-in-bio sightings these days. I just went to a talk by Rob Knight, who works on the human microbiome project, and he mentioned developing PyCogent with some collaborators. A lab on campus uses TAMO for motif searching. Cistematic and a variety of tools from the Wold Lab use Python. James Taylor is pushing Galaxy pretty hard. So I don't despair for Python's presence in biology.
I think the world is moving, medium-to-long-term, towards the use of Perl for scripting-level work, Python for frameworks and re-usable software, and R for statistical analysis of data sets (BioConductor is also popping up a lot these days). Personally I think this is the right approach and bodes well in the long term.
--
Second, Chris says,
I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.
I, too, am struggling with the problem that research scientists, somewhat shockingly, are more interested in doing (and funding) novel research than in building re-usable software. OK, I'm being a bit sarcastic, but that's only a mildly sarcastic statement, really; while it's understandable that researchers want to do research, the rise of large-scale data and computational methods in biology unambiguously argues for computational competence in the next generation of researchers. Part of computational competence is knowing how to get stuff done effectively and correctly, not to mention with reusable software when possible. I am actually shocked that there's so little focus on Software Carpentry-like skills in science and education, and I'm doing my best to push on that front here at MSU (see my very first course here, which is introducing Python, Subversion and automated testing to CSE undergrads).
That computing in biology sucks is not by any means a novel observation; see this nice article, Computational Biology Resources Lack Persistence and Usability, for example. My take on things is that the funding bodies simply need to recognize the utility of software maintenance, which is slowly happening, and that the undergrad and graduate departments need to adapt to the future by teaching this stuff. But there's no question it's going to be Darwinian out there -- as Stewart Brand says, "Once a new technology starts rolling, if you're not part of the steamroller, you're part of the road." Hopefully some of us can be the steamroller and not the road, yeh?
So what's my solution, you might ask? Well, now that I'm a bigshot professor, I'm going to be encouraging (well, demanding where possible) that my students and collaborators use good software development techniques and release their source code and data. But my real "secret" -- and please steal it if you can :) -- is that I hope to continue building a real infrastructure that can underlie solutions to my various research problems. If I can build a re-usable core of well-tested tools on top of a solid framework, I should be able to do research faster, better, smarter, and more reliably than my colleagues and competitors. That should translate into more publications, more grants, and more problems actually solved. (I'll let you know how that goes; it's early days still.)
That, incidentally, is why you should ignore people who tell you not work on your coding or on general-purpose libraries: because if it's useful to you, it's worth doing right and making it useful to yourself in the future.
This is also one of the reasons why I'm investing a substantial amount of my scarcest resource (time) in pygr. pygr is a solution for scalable storage, retrieval, and named persistence of sequence-associated data, and it works fantastically well. The real problem with pygr is the high barrier to entry, and that's what we're working on lowering, if only so that my own students will have less trouble learning it.
Some other time I'll talk about why pygr and pygr-like solutions are the right solution to reusability in bioinformatics.
So, in summary: don't worry, be happy, Python is coming to bioinformatics one way or another. And don't worry, just work hard at becoming the steamroller (and not the road) by improving your coding skills and becoming a general-purpose computational scientist, or at least general-purpose bioinformatician. You won't regret it.
Heck, you can always come work for me, right? ;)
--titus
Position: Assistant Professor/Comparative Genomics
We have an opening for a project on which I'm collaborating:
Full-time 12 month appointment academic position for a genomics scientist. The incumbent will spend 50% time as the Associate Director of the Comparative Genomics Laboratory, with duties in directing daily activities, long-range planning and seeking extramural funding, and 50% time to conduct research. Qualified candidates should have an earned Ph.D, training in Bioinformatics or Quantitative Biology and strong interests in evolution of early vertebrate genomes and functional genomics. Postdoctoral experience preferred. Initial appointment will be for three years as a fixed-term assistant professor in the Department of Fisheries and Wildlife at Michigan State University, with possible renewal. Email a single PDF or word file containing cover letter, CV and statement of research interests to liweim<at>msu.edu before 30 Sept. 2008.
Note that this is a non-tenure-track assistant professorship, so it's fixed term and soft money. We don't have positions like this in my departments (comp sci & molecular biology) but you could think of this as a very senior, very well paid, and very independent postdoc position.
--titus
Python for Intro CS?
I'm surprised I haven't seen this on planetpython yet...
...an emerging consensus in the scripting community holds that Python is the right solution for freshman programming. Ruby would also be a defensible choice.
(emphasis mine). Originally found via Lambda the Ultimate, and also passed onto me by Rich Enbody.
In other news, there are some rumors coming out of the intro CS course (CSE 231/232) here at Michigan State that the switch to Python from C++ for the first term, 231, didn't affect the students' performance in the follow-on course, 232. That is, students performed equally well on the 232 final independently of whether or not they'd had Python or C++ in 231. I had hoped for an improvement in the scores, but at least it's not a decline!
--titus
SciFoo - am I just jealous?
I read things like this report on SciFoo and think, gawd! I'd have had a great time! I should try to beg/bully/buy/brown-nose my way into the next SciFoo so I can talk about Science 2.0 etc.!
And then I think back to the heady days of ALife when all that stuff was pretty new, and wacky ideas were being proposed, and the conferences gatherings were crazy interesting and fun, and realize that -- apart from a lot of great connections, a few publications, and a wife -- I didn't get much of lasting import from that whole ALife thing. What really added to my life, long-term, from that period was execution ability. I wrote some code, did some research, and ran conferences; those have all stuck with me. The people-talking and socializing didn't stick except in so far as it led to interesting research (well, and a wife, but I'm not looking for another one of those).
My assessment is that I really just need to buckle down and produce over the next few years. This Science 2.0 stuff will come and go, and I'll adjust as I need to; but since it's unlikely to offer me a revolutionary way of doing science, I'm better off doing good science first and only then worrying about socializing. (YMMV, esp if you're Mike Eisen. :)
Also, looking at my work schedule for the last few weeks (talking with students about their projects; ordering stuff for my lab; discussing research with my postdoc; and generally getting shit together) it's hard to argue that there are more important things for me to be doing than that, at least in the academic sphere.
To put things another way: talk is cheap. Action speaks louder than words. Ideas multiply execution.
Or perhaps I'm just bitter that I didn't get invited. It sounds like fun!
--titus
A reply to Elanthis: Python Annoyances
In reply to elanthis's post on Advogato,
1. I agree that the documentation could be improved, and we've been working on it. The next release should add a whole bunch of examples. Google is your friend, as is the Python Cookbook.
- foo/some/other/package/blah.py:
- class MyClass:
- ...
- foo/__init__.py
- from foo.some.other.package.blah import MyClass
The physical layout is for you, the developer, while the exposed package interface can be pretty much whatever you want.
3. Classes are weak: yes, I guess so, but I don't really know how to address your concerns without adding a lot of syntax. Are you just lusting after variable declarations 'cause that's how you think?
Your take on unit testing seems just plain wrong. I know of no useful language that can prevent the majority of programming errors without some form of actually running the code, a.k.a. "testing". You might think YMMV, but you're almost certainly wrong.
5. No variable declaration: you'd catch most of these problems with even the most rudimentary of unit tests and code coverage analysis. Shadowing is a concern, though. In practice it's never caused problems for me.
6. There are official recommendations regarding docstrings; see PEP 257. The (new) Python documentation is formatted with Sphinx, which you might like better.
I think you have a good point or two, but I also think you need to spend some more time programming in Python to figure out which of your complaints are actual problems with Python and what is simply a legacy of bad habits garnered from experiences with other languages. Even if you abandon Python for another dynamic language, I think you'll have the same (or stronger) criticisms of those.
cheers, --titus
Helping Python
Recently the question came up: suppose you wanted to give enthusiastic people some guidance on how to help work on Python. What suggestions do you have? Surely there's a Web page on this!
Well, no: a few quick Google searches led me to discover that "contributing to Python" was answered neither succinctly nor well in any of the resulting pages.
So I thought, heck! Let's create a wiki page!
http://wiki.python.org/moin/HelpingPython
Check it over and add in your own ideas.
--titus
The Fragile Light
Just finished the book The Fragile Light, by David Nurenberg. Good stuff; independent author. Worth reading.
Briefly, it's a SF&F novel about a world where mutants are sometimes heroes, and more often feared; where there are Herotown ghettoes full of supers; and where only licensed heroes can join in the game. It reminds me a bit of the early Wild Cards books, but better written. A fun read.
--titus
zounds, for running lots of BLASTs
I finally got sick of manually schlepping BLAST files around, so I wrote something to do it for me. 'zounds' is a very simple server/client system for coordinating a bunch of 'worker' nodes through a central server; it does everything in Python with objects and pickling, so it's easy to do extra Python-based processing on the worker nodes. See 'filters' for more info.
You can read a bit more about zounds here:
http://iorich.caltech.edu/~t/zounds/README.html
It's freely available, open-source, etc. etc.
Comments and thoughts welcome; send them to the bip list.
--titus
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!