Hacking your genome

Posted 23 Jul 2000 at 18:14 UTC by ewan Share This

Are you a hacker? Do you yearn for something more important to work on than yet-another-gnome-applet? Are you annoyed that you can't find a problem that is fun to code and stretches your brain in new ways... bioinformatics might be the answer.

A brave new world

Bioinformatics is a weird new science between biology and computer science. Like oil and water these two sciences have studiously kept apart for the last couple of decades: unfortunately they are being thrown together at the moment due to the advances in automated data gathering in molecular biology.

The best example of this is determining the DNA sequence of an organism's genome (the genome is the sum information passed from parent to child). This has become so automated that genomes the size of the human genome can be tackled; the result is a database the size of a national library of DNA and still growing exponentially. And although we have this data stored, it is still like reading Mayan glyphs. It is clearly important, but we generally can't figure out what it says.

Hackers wanted

On Slashdot, the advances by the commercial companies in genomics are touted as news; and yet they ignore the public effort that underpins this development and consider it beneath them to even mention that free software is the thing driving the genome projects. Have you used CGI.pm, the commonest way to link Perl up to the Web? - Written by Lincoln Stein, a genomics researcher, as was the original GD for GIF making, with the gd C library written by researchers in genomics. Free databases were a necessity in this field long before postgres and mysql found their feet.

But we still need people. The amount of data is growing faster than anyone expected and only a handful of people can both remain with academic ideals and coding potential. We need hackers to join any number of projects out there. And there are a host to join. If you just liking hacking perl or you prefer compiler technology, there is something to suit you.

Most wanted

Many people will be getting together at BOSC the Bioinformatics Open Source Conference.

Here is a list of potential projects to join up to in the free software world of bioinformatics. Just think - your source code will be hacked and isn't it better that it gets hacked by free (as in speech) software?

(NB, people should be aware that I am leader for both the Ensembl and bioperl projects, and wrote both Wise2 and Dynamite. In fact, I am somewhat involved in all these projects. This is a pretty biased view therefore of the field).

    The whole genome
  • Ensembl Built on top of bioperl, delivering a free (in every sense) genome to the world.
  • ACeDB A free database before they were trendy. An object-based database designed for biology.

    The Bio* projects are cohosted and loosely organised

  • bioperl The oldest free software project of the set. Mature and crufty; in need of new blood
  • biopython The clean-object scripting language which is coming up to its first release
  • biojava The object orientated bio project, also close to its first release
  • bioxml Trying to provide data-neutral structures for biology. In its infancy.
  • Biocorba (no web site, coordinated through bioperl at the moment). Trying to provide inter-language components to allow the projects to leverage off each other

    Graphical viewers

  • Artemis a bacterial genome curation tool
  • Apollo A eukaryotic genome curation tool
  • DAS Distributed annotation, the way to deliver the genome to the biologist


  • EMBOSS a C library and programs for bioinformatics
  • HMMER The premiere protein hidden Markov model package
  • Wise2 Another hidden Markov modeling package.
  • Dynamite and Telegraph A language for expressing probabilistic finite state machines - telegraph is the new project rising out of the destruction of Dynamite.


  • bioinformatics.org provides co-hosting of free software projects.

Please join!

If anyone in the know thinks I have missed something out, please post below!

Good going, posted 23 Jul 2000 at 19:08 UTC by penguin42 » (Journeyer)

I started looking at some of the genome sites a few months ago; perhaps the most effectivly interlinked databases I've ever seen - extremely clever.

I wish I understand more at the genetic level - is there a 'genetic code hacking for Computer geeks' ?

There is some clever stuff in these old genes and I wish I understood it.

(and could do with applying some minor patches here and there....)

genetic code for computer geeks, posted 24 Jul 2000 at 21:52 UTC by lorenz » (Observer)

i don't know if there is such a document but i can tell you some resources that have been very useful to me when i started with this bioinformatics stuff...

first there is this page with lots of tutorials and manuals on how to use and search the genome databases that there are on the internet.
and you could try looking at the national center of biotechnology information, because they have really good tutorials lying around there (regarding databases and search algorithms)

hope, this helps a bit.

Definitely the wave of the future, posted 25 Jul 2000 at 16:23 UTC by ribozyme » (Apprentice)

This is a great thread! Of course I'm somewhat biased given my background, but I think that there's no reason why anyone with a bit of curiosity and good old-fashioned intellectual playfulness can't hack on bioinformatics algorithms (especially ones dealing with DNA or protein sequence information, since this can easily be abstracted into a computational problem solving task; some of the writings of Douglas Hofstadter are a perfect example of this). In this spirit, Greg Egan's science fiction novel Permutation City portrays a near-future in which people play around with a popular computer simulation called 'The Autoverse', which is a completely synthetic universe right down to cellular processes and atomic interactions.

At the research institute where I work, we have some very gifted programmers with very little scientific background producing some very interesting database-driven programs for analysing high-throughput PCR product analysis. I've spoken with them and they find it a refreshing change from their everyday programming. The popularity of initiatives like SETI@home (especially in the Linux community) indicates that there is a widespread interest in the intersection between science and computers.

I just wanted to point out that Bioinformatics goes beyond programs that analyse DNA/protein sequences. There are emerging fields known as structural genomics and proteomics which are offshoots of the genome projects. There is going to be a huge need for computational analysis of protein expression patterns and potentials for protein-protein interactions within a cell. This sort of work would require a more hardcore scientific background, but there is going to be a real demand for people in these fields. My interests lie in what's beginning to be referred to as 'functional proteomics', which builds on genomics, proteomics, and structural genomics in an attempt to model a virtual cell; the E-Cell Project is a good example of this.

Exciting times, indeed!

Relational Databases and Data Models, posted 7 Aug 2000 at 20:31 UTC by bwtaylor » (Journeyer)

I'm an oracle database guy by profession whose been dabbling in bioinformatics stuff. I got interested in it because my good friend is getting his PhD in the field, at it's obviously "hot".

One of the things that has astounded me is the complete lack of standards regarding how data is shared. Many (most?) of the datasources don't even use relational database systems. It seems rather obvious to me that this phenomonon is seriously holding the field back, although I have no idea if this perception is shared inside the field. It seems like there is a desparate need to standardize some data models, since there are gigabytes, even terabytes of data pouring in.

I suspect that there is a little bit of culture shock going on as biologists are being dragged by necessity towards the more analytical fields of infromation technology and programming. Certainly there is a critical shortage of hacker biologists. I remember my undergrad days when the math-physics-CS types and the biologist types were definitely different religions, so to speak.

Intellectual Property Concerns, posted 7 Aug 2000 at 20:40 UTC by bwtaylor » (Journeyer)

Intellectual property issues are extremely troubling to me in bioinformatics. Open source programmers would probably puke at the idea that their contributions could feed the industry quest to ID and patent genes. There should be some way to licence software so that discoveries made using it remain "free", but this raises equally troubling problems.

The basic idea of open source is to retain the IP (copyright) interest and use it proactively to turbo-charge the public domain in terms of sharing development ideas, but traditionally the use of the software isn't constrained. In biology, however the "development" isn't just confined to software.

Has anybody out there thought much about this?

Bioinformatics libraries and articles, posted 24 Aug 2000 at 11:05 UTC by welisc » (Journeyer)

The CCP11 site has an annotated list of Bioinformatics Programming Libraries.

bioinformatics.org hosts bioinformatics projects and carries submitted links to news about the field.

The current issue of the MIT Tech Review has an article on structural genomics. Links to other articles on bioinformatics can be found on my Geocities page.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page