Older blog entries for titus (starting at number 463)

Top 12 reasons you know you are a Big Data biologist

A few people have recently asked me what this "Big Data" thing is in biology. It's not an easy question to answer, though, because biology's a bit peculiar, and a lot of Big Data researchers are not working in bio. While I was thinking about this I kept on coming up with anecdotes -- and, well, that turned into this: the Top 12 Reasons You Know You Are a Big Data Biologist.

---

  1. You no longer get enthused by the prospect of more data.

I was at a conference a few months back, and an Brit colleague of mine rushed up to me and said, "Hey! We just got an Illumina HiSeq! Do you have anything you want sequenced?" My immediate, visceral response was "Hell no! We can't even deal with a 10th of the sequence we've already got! PLEASE don't send me any more! Can I PAY you to not send me any more?"

  1. Analysis is the bottleneck.

I'm dangerously close to genomics bingo here, but:

I was chatting with a colleague in the hallway here at MSU, pitching some ideas about microbiome work, and he said, "What a good idea! I have some samples here that I'd like to sequence that could help with that." I responded, "How about we stop producing data and think about the analysis?" He seemed only mildly offended -- I have a rep, you see -- but biology, as a field, is wholly unprepared to go from data to an analyses.

My lab has finally turned the corner from "hey, let's generate more data!" to "cool, data that we can analyze and build hypotheses from!" -- yeah, we're probably late bloomers, but after about 3 years of tech development, there's very little we can't do at a basic sequence level. Analysis at that level is no longer the bottleneck. Next step? Trying to do some actual biology! But we are working in a few very specific areas, and I think the whole field needs to undergo this kind of shift.

  1. You've finally learned that 'for' loops aren't all they're cracked up to be.

This was one of my early lessons. Someone had dumped a few 10s of millions of reads on me -- a mere gigabyte or so of data -- and I was looking for patterns in the reads. I sat down to write my usual opening lines of Python to look at the data: "for sequence in dataset:" and ... just stopped. The data was just too big to do iteration in a scripting language with it! Cleverness of some sort needed to be applied.

Corollary: not just Excel, but BLAST, start to look like Really Bad Ideas That Don't Scale. Sean Eddy's HMMER 3 FTW...

  1. Addressing small questions just isn't that interesting.

Let's face it, if you've just spent a few $k generating dozens of gigabytes amounts of hypothesis-neutral data on gene expression from (say) chick, the end goal of generating a list of genes that are differentially regulated is just not that exciting. (Frankly, you probably could have done it with qPCR, if you didn't want that cachet of "mRNAseq!") What more can you do with the data?

(This point comes courtesy of Jim Tiedje, who says (I paraphrase): "The problem for assistant professors these days is that many of may not be thinking big enough." Also see: Don't be afraid of failure: really go for it)

The biggest training challenge (in my opinion) is going to be training people in how to push past the obvious analysis of the data and go for deeper integrative insight. This will require training not just in biology but in data analysis, computational thinking, significant amounts of informed skepticism, etc. (See our course.) I think about it like this: generating hypotheses from large amounts of data isn't that interesting -- I can do that with publicly available data sets without spending any money! Constraining the space of hypotheses with big data sets is far more interesting, because it gives you the space of hypotheses that aren't ruled out; and its putting your data to good use. I'll, uh, let you know if I ever figure out how to do this myself...

I think there are plenty of people that can learn to do this, but as Greg Wilson correctly points out, there has to be a tradeoff: what do you take out of existing training curricula, to be replaced with training in data analysis? I wish I knew.

  1. Transferring data around is getting more than a bit tedious.

For some recent papers, I had to copy some big files from EC2 over to my lab computer, and from there to our HPC system. It was Slow, in the "time for lunch" sense. And these were small test data sets, compressed. Transferring our big data sets around is getting tedious. Luckily we have a lot of them, so I can usually work on analysis of one while another is transferring.

5. Your sysadmin/HPC administrator yells at you on a regular basis about your disk usage.

We regularly get nastygrams from our local sysadmins accusing of using too many terabytes of disk space. This is in contrast to the good ol' days of physics (which is where I got my sysadmin chops), where your sysadmin would yell at you for using too much CPU...

  1. You regularly crash big memory machines.

My favorite quote of the year so far is from the GAGE paper), in which Salzberg et al (2012) say "For larger genomes, the choice of assemblers is often limited to those that will run without crashing." Basically, they took a reasonably big computer, threw some big data at various assembly packages, and watched the computer melt down. Repeatedly.

Someone recently sent me an e-mail saying "hey, we did it! we took 3 Gb of sequence data from soil and assembled it in only 1 week in 3 TB of RAM!" Pretty cool stuff -- but consider that at least four to six runs would need to be done (parameter sweep!), and it takes only about 1 week and $10k to generate twice that data. In the long run, this does not seem cost effective. (It currently takes us 1-2 months to analyze this data in 300 GB of RAM. I'm not saying we have the answer. ;)

  1. Throwing away data looks better and better.

I made a kind of offhanded comment to a NY Times reporter once (hint: don't do this) about how at some point we're going to need to start throwing away data. He put it as the ultimate quote in his article. People laughed at me for it. (BUT I WAS RIGHT! HISTORY WILL SHOW!)

But seriously, if someone came up to you and said "we can get rid of 90% of your data for you and give you an answer that's just as good", many biologists would have an instant negative response. But I think there's a ground truth in there: a lot of Big Data is noise. If you can figure out how to get rid of it... why wouldn't you? This is an interesting shift in thinking from the "every data point is precious and special" that you adopt when it takes you a !#%!#$ week to generate each data point.

I attended a talk that David Haussler gave at Caltech recently. He was talking about how eventually we would need to resequence millions of individual cancer cells to look for linked sets of mutations. At 50-300 Gb of sequence per cell, that's a lot of data. But most of that data is going to be uninteresting -- wouldn't it be great if we could pick out the interesting people and then throw the rest away? It would certainly help with data archiving and analysis...

8. Big computer companies call you because they're curious about why you're buying such big computers.

True story: a Big Computer Company called up our local HPC to ask why we were buying so many bigmem machines. They said "It's the damned biologists -- they keep on wanting more memory. Why? We don't know - we suspect the software's badly written, but can't tell. Why don't you talk to Titus? He pretends to understand this stuff." I don't think it's weird to get calls trying to sell me stuff -- but it is a bit weird to have our local HPC's buying habits be so out of character, due to work that I and others are doing, that Big Computer Companies notice.

(Note: the software's not mine, and it's not badly written, either.)

9. Your choice must increasingly be "improve algorithms" rather than "buy bigger computers"

I've been banging this drum for a while. Sequencing capacity is outpacing Moore's Law, and so we need to rethink algorithms. An algorithm that was nlogn used to be good enough; now, if analysis requires a supra-linear algorithm, we need to figure out how to make it linear. (Sublinear would be better.)

Anecdote: we developed a nifty data structure for attacking metagenome assembly (see: http://arxiv.org/abs/1112.4193). It scaled (scales) assembly by a factor of about 20x, which got us pretty excited -- that meant we could in theory assemble things like MetaHIT and rumen on commodity hardware without doing abundance filtering. Literally the day that we told our collaborators we had it working, they dumped 10x more data on us and told that they could send us more any time we wanted. (...and that, boys and girls, was our introduction to the HiSeq!) 20x wasn't enough. Sigh.

The MG-RAST folk have told me a similar story. They did some awesomely cool engineering and got their pipeline running about 100x faster. That'll hold them for a year or so against the tidalwave of data.

Corollary: don't waste your time with 2% improvements in sensitivity and specificity unless you also deliver 10x in compute performance.

10. You spend a lot of time worrying about biased noise, cross-validation, and the incorrect statistical models used.

We were delayed in some of our research by about a year, because of some systematic biases being placed in our sequencing data by Illumina. Figuring out that these non-biological features were there took about two months; figuring out how to remove them robustly took another 6 months; and then making sure that removing didn't screw up the actual biological signal took another four months.

This is a fairly typical story from people who do a lot of data analysis. We developed a variety of cross-validation techniques and ways of intuiting whether or not something was "real" or "noise", and we spent a certain amount of time discussing what statistical approaches to use to assess significance. In the end we more or less gave up and pointed out that on simulated data what we were doing didn't screw things up.

  1. Silicon Valley wants to hire your students to go work on non-biology problems.

Hey, it's all Big Data, right?

---

So: what is Big Data in biology?

First, I've talked mostly about DNA sequence analysis, because that's what I work on. But I know that proteomics and image analysis people are facing similar dilemmas. So it's not just sequence data.

Second, compute technology is constantly improving. So I think we need moving definitions.

Here are some more serious points that I think bear on what, exactly, problems for Big Data in biology. (They're not all specific to biology, but I can defend them on my home ground more easily, you see.)

  1. You have archival issues on a large scale.

You have lots of homogeneously formatted data that probably contains answers you don't know you're looking for yet, so you need to save it, metadata it, and catalog it. For a long time.

  1. The rate at which data is arriving is itself increasing.

You aren't just getting one data set. You're getting dozens (or hundreds) this year. And you'll get more than that next year.

One implication of this is that you'd better have a largely automated analysis pipeline, or else you will need an increasing number of people just to work with the data, much less do anything interesting. Another implication is that software reuse becomes increasingly important: if you're building custom software for each data set, you will fall behind. A third implication is that you need a long-term plan for scaling your compute capacity.

  1. Data structure and algorithm research is increasingly needed.

You cannot rely on many heavyweight iterations over your data, or simple data structures for lookup: the data is just too big and existing algorithms are tailored to smaller data. For example, BLAST works fine for a few gigabytes of data; past that, it becomes prohibitively slow.

  1. Infrastructure designers are needed.

Issues of straightforward data transfer, network partitioning, and bus bandwidth start to come to the forefront. Bottleneck analysis needs to be done regularly. In the past, you could get away with "good enough", but as throughput continues to increase, bottlenecks will need to be tackled on a regular basis. For this, you need a person who is immersed in your problems on a regular basis; they are hard to find and hard to keep.

One interesting implication here is for cloud computing, where smart people set up a menu of infrastructure options and you can tailor your software to those options. So far I like the idea, but I'm told by aficionados that (for example) Amazon still falls short.

  1. You have specialized hardware needs.

Sort of a corollary of the above: what kind of analyses do you need to do? And what's the hardware bottleneck? That's where you'll get the most benefit from focused hardware investment.

  1. Hardware, infrastructure design, and algorithms all need to work together.

Again, a corollary of the above, but: if your bottleneck is memory, focus on memory improvements. If your bottleneck is disk I/O, focus on hardware speed and caching. If your bottleneck is data transfer, try to bring your compute to your data.

  1. Software needs to change to be reusable and portable.

Robust, reusable software platforms are needed, with good execution guarantees; that way you have a foundation to build on. This software needs to be flexible (practically speaking, scriptable in a language like Python or Ruby or Perl), well developed and tested, and should fade into the background so that you can focus on more interesting things like your actual analysis. It should also be portable so that you can "scale out" -- bring the compute to your data, rather than vice versa. This is where Hadoop and Pig and other such approaches fit now, and where we seriously need to build software infrastructure in biology.

  1. Analysis thinking needs to change.

Comprehensively analyzing your data sets is tough when your data sets are really big and noisy. Extracting significant signals from them is potentially much easier, and some approaches and algorithms for doing this in biology exist or are being developed (see especially Lior Pachter's eXpress). But this is a real shift in algorithmic thinking, and it's also a real shift in scientific thinking, because you're no longer trying do understand the entire data set -- you're trying to focus on the bits that might be interesting.

  1. Analyses are increasingly integrative.

It's hard to make sense of lots of data on its own: you need to link it in to other data sets. Data standards and software interoperability and "standard" software pipelines are incredibly important for doing this.

  1. The interesting problems are still discipline-specific.

There are many people working on Big Data, and there is big business in generic solutions. There's lots of Open Source stuff going on, too. Don't reinvent those wheels; figure out how to connect them to your biology, and then focus on the bits that are interesting to you and important for your science.

11. New machine learning, data mining, and statistical models need to be developed for data-intensive biological science.

As data volume increases, and integrative hypothesis development proceeds, we need to figure out how to assess the quality and significance of hypotheses. Right now, a lot of people throw their data at several programs, pick their favorite answer, and then recite the result as if it's correct. Since often they will have tried out many programs, this presents an obvious multiple testing problem. And, since users are generally putting in data that violates one or more of the precepts of the program developers, the results may not be applicable.

  1. A lack of fear of computational approaches is a competitive advantage.

The ability to approach computational analyses as just another black box upon which controls must be placed is essential. Even if you can open up the black box and understand what's inside, it needs to be evaluated not on what you think it's doing but on what it's actually doing at the black box level. If there's one thing I try to teach to students, it's to engage with the unknown without fear, so that they can think critically about new approaches.

---

Well, that's it for now. I'd be interested in hearing about what other people think I've missed. And, while I'm at it, a hat tip to Erich Schwarz, Greg Wilson, and Adina Howe for reading and commenting on a draft.

--titus

Syndicated 2012-03-06 22:28:01 from Titus Brown

Data Intensive Science, and Workflows

I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was invited to attend as a token biologist -- someone in biology who works with large scale data, and thinks about large scale data, but isn't necessarily only devoted to dealing with large scale data.

The workshop was pretty interesting. It was run by Microsoft, who invited me & paid my way out there. The idea was to identify areas of opportunity in which Microsoft could make investments to help out scientists and develop new perspectives on the future of eScience. To do that, we played a game that I'll call the "anchor game", where we divvied up into groups to discuss the blocks to our work that stemmed from algorithms and tools, data, process and workflows, social/organizational aspects. In each group we put together sticky notes with our "complaints" and then ranked them by how big of an anchor they were on us -- "deep" sticky notes held us back more than shallow sticky notes. We then reorganized by discipline, and put together an end-to-end workflow that we hoped in 5 years would be possible, and then finally we looked for short- and medium-term projects that would help get us there.

The big surprise for me in all of this was that it turns out I'm most interested in workflows and process! All of my sticky notes had the same theme: it wasn't tools, or data management, or social aspects that were causing me problems, but rather the development of appropriate workflows for scientific investigation. Very weird, and not what I would have predicted from the outset.

Two questions came up for me during the workshop:

Why don't people use workflows in bioinformatics?

The first question that comes to mind is, why doesn't anyone I know use a formal workflow engine in bioinformatics? I know they exist (Taverna, for one, has a bunch of bioinformatics workflows); I'm reasonably sure they would be useful; but there seems to be some block against using them!

What would the ideal workflow situation be?

During the anchor game, our group (which consisted of two biologists, myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few computer scientists, including Alex Wade) came up with an idea for a specific tool. The tool would be a bridge between Datascope for biologists and a workflow engine. The essential idea is to combine data exploration with audit trail recording, which could then be hardened into a workflow template and re-used.

---

Thinking about the process I usually use when working on a new problem, it tends to consist of all these activites mixed together:

  1. evaluation of data quality, and application of "computational" controls
  2. exploration of various data manipulation steps, looking for statistical signal
  3. solidifying of various subcomponents of the data manipulation steps into scripted actions
  4. deployment of the entire thing on multiple data sets

Now, I'm never quite done -- new data types and questions always come up, and there are always components to tweak. But portions of the workflow become pretty solid by the time I'm halfway done with the initial project, and the evaluation of data quality accretes more steps but rarely loses one. So it could be quite useful to be able to take a step back and say, "yeah, these steps? wrap 'em up and put 'em into a workflow, I'm done." Except that I also want to be able to edit and change them in the future. And I'd like to be able to post the results along with the precise steps taken to generate them. And (as long as I'm greedy) I would like to work at the command line, while I know that others would like to be able to work in a notebook or graphical format. And I'd like to be able to "scale out" the computation by bringing the compute to the data.

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I'm guessing no tool out there does this, although I know other people are thinking this way, so maybe I'm wrong. The Microsoft folk didn't know of any, though, and they seemed pretty well informed in this area :).

The devil's choice I personally make in all of this is to go for workflow agility, and ignore versioning and tracking and the component model, by scripting the hell out of things. But this is getting old, and as I get older and have to teach my wicked ways to grad students and postdocs, the lack of versioning and tracking and easy scaling out gets more and more obvious. And now that I'm trying to actually teach computational science to biologists, it's simply appallingly difficult to convey this stuff in a sensible way. So I'm looking for something better.

One particularly intriguing type of software I've been looking at recently is the "interactive Web notebook" -- ipython notebook and sagenb. These are essentially Mathematica or matlab-style notebook packages that work with IPython or Sage, scientific computing and mathematical computing systems in Python. They let you run Python code interactively, and colocate it with its output (including graphics); the notebooks can then be saved and loaded and re-run. I'm thinking about working one or the other into my class, since it would let me move away from command-line dependence a bit. (Command lines are great, but throwing them at biologists, along with Python programming, data analysis, and program execution all together, seems a bit cruel. And not that productive.)

It would also be great to have cloud-enabled workflow components. As I embark on more and more sequence analysis, there are only about a dozen "things" we actually do, but mixed and matched. These things could be hardened, parameterized into components, and placed behind an RPC interface that would give us a standard way to execute them. Combined with a suitable data abstraction layer, I could run the components from location A on data in location B in a semi-transparent way, and also record and track their use in a variety of ways. Given a suitably rich set of components and use cases, I could imagine that these components and their interactions could be executed from multiple workflow engines, and with usefully interesting GUIs. I know Domain Specific Languages are already passe, but a DSL might be a good way to find the right division between subcomponents.

I'd be interested in hearing about such things that may already exist. I'm aware of Galaxy, but I think the components in Galaxy are not quite written at the right level of abstraction for me; Galaxy is also more focused on the GUI than I want. I don't know anything about Taverna, so I'm going to look into that a bit more. And, inevitably, we'll be writing some of our own software in this area, too.

Overall, I'm really interested in workflow approaches that let me transition seemlessly between discovery science and "firing for effect" for hypothesis-driven science.

A few more specific thoughts:

In the area of metagenomics (one of my research focuses at the moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow that lets semi-sophisticated computational users (hi mom!) run their stuff on the Amazon Cloud and on private HPCs or clouds. While I appreciate hosted services, there are many drawbacks to them, and I'd love to get my hands on the guts of those services. (I'm sure the MG-RAST folk would like me to note that they are moving towards making their pipeline more usable outside of Argonne: so noted.)

In the comments to my post on Four reasons I won't use your data analysis pipeline, Andrew Davison reminds me of VisTrails, which some people at the MS workshop recommended.

I met David De Roure from myExperiment at the MS workshop. To quote, "myExperiment makes it easy to find, use, and share scientific workflows and other Research Objects, and to build communities."

David put me in touch with Carole Goble who is involved with Taverna. Something to look at.

In the cloud computing workshop I organized at the Planet and Animal Genome conference this January, I will get a chance to buttonhole one of the Galaxy Cloud developers. I hope to make the most of this opportunity ;).

It'd be interesting to do some social science research on what difficulties users encounter when they attempt to use workflow engines. A lot of money goes into developing them, apparently, but at least in bioinformatics they are not widely used. Why? This is sort of in line with Greg Wilson's Software Carpentry and the wonderfully named blog It will never work in theory: rather than guessing randomly at what technical directions need to be pursued, why not study it empirically? It is increasingly obvious to me that improving computational science productivity has more to do with lowering learning barriers and changing other societal or cultural issues than with a simple lack of technology, and figuring out how (and if) appropriate technology could be integrated with the right incentives and teaching strategy is pretty important.

--titus

p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the workshop, and especially for paying my way. 'twas really interesting!

Syndicated 2011-12-11 15:00:57 from Titus Brown

Trying out 'cram'

I desperately need something to run and test things at the command line, both for course documentation (think "doctest" but with shell prompts) and for script testing (as part of scientific pipelines). At the 2011 testing-in-python BoF, Augie showed us cram, which is the mercurial project's internal test code ripped out for the hoi polloi to use.

Step zero: wonder-twin-powers activate a new virtualenv!

% virtualenv e
% . e/bin/activate

Step one: install!

% pip install cram

... that just works -- always a good sign!

OK, let's test the bejeezus out of 'ls'.

% mkdir cramtest
% cd cramtest

Next, I put

$ ls

into a file. Be careful -- you apparently need exactly two spaces before the $ or it doesn't recognize it like a test.

Now, I run:

% cram ls.t

and I get

.
# Ran 1 tests, 0 skipped, 0 failed.

Awesome! A dot!

The only problem with this is that when I run 'ls' myself, I see:

ls.t    ls.t~

Hmm.

As a test of the cram test software, let's modify the file 'ls.t' to contain a clearly broken test, rather than an empty one:

$ ls
there is nothing here to see

and I get

!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
@@ -1,2 +1,1 @@
   $ ls
-  there is nothing here to see

# Ran 1 tests, 0 skipped, 1 failed.

OK, so I can make it break -- excellent! Cram comes advertised with the ability to fix its own tests by replacing broken output with actual output; let's see what happens, shall we?

% cram -i ls.t

!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
@@ -1,2 +1,1 @@
   $ ls
-  there is nothing here to see
Accept this change? [yN] y
patching file /Users/t/dev/cramtest/ls.t
Reversed (or previously applied) patch detected!  Assume -R? [n] y
Hunk #1 succeeded at 1 with fuzz 1.

# Ran 1 tests, 0 skipped, 1 failed.
% more ls.t
$ ls
there is nothing here to see
there is nothing here to see

OK, so, first, wtf is the whole reversed patch detected nonsense? Sigh. And second, where's the output from 'ls' going!?

Hmm, maybe cram is setting up a temp directory? That would explain a lot, and would also be a very sensible approach. It's not mentioned explicitly on the front page, but if you read into it a bit, it looks likely. OK.

Let's modify 'ls.t' to create a file:

$ touch testme
$ ls

and run it...

% cram ls.t
!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
@@ -1,3 +1,4 @@
   $ touch testme
   $ ls
+  testme


# Ran 1 tests, 0 skipped, 1 failed.

Ah-hah! Now we're getting somewhere! Fix the test by making 'ls.t' read like so:

$ touch testme
$ ls
testme

and run:

% cram ls.t
.
# Ran 1 tests, 0 skipped, 0 failed.

Awesome! Dot-victory ho!

Now let's do something a bit more interesting: check out and run my PyCon 2011 talk code for ngram graphs. Starting with this in 'khmer-ngram.t',

$ git clone git://github.com/ctb/khmer-ngram.git
$ cd khmer-ngram
$ ls
$ python run-doctests.py basic.txt

I run 'cram khmer-ngram.t' and get

!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
@@ -1,4 +1,15 @@
   $ git clone git://github.com/ctb/khmer-ngram.git
+  Initialized empty Git repository in /private/(yada, yada)
   $ cd khmer-ngram
   $ ls
+  basic.html
+  basic.txt
+  data
+  graphsize-book.py
+  hash.py
+  load-book.py
+  run-doctests.py
+  shred-book.py
   $ python run-doctests.py basic.txt
+  ... running doctests on basic.txt
+  *** SUCCESS ***

# Ran 1 tests, 0 skipped, 1 failed.

After getting cram to fix the file (using -i), and re-running cram, it now chokes at exactly one place; betcha you can guess where...:

!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
@@ -1,5 +1,5 @@
   $ git clone git://github.com/ctb/khmer-ngram.git
-  Initialized empty Git repository in /private/(yada, yada)
+  Initialized empty Git repository in /private/(different yada)
   $ cd khmer-ngram
   $ ls
   basic.html

# Ran 1 tests, 0 skipped, 1 failed.

Right. How do you deal with output that does change unpredictably? Easy! Throw in a wildcard regexp like so

Initialized empty Git repository in .* (re)

My whole khmer-ngram.t file now looks like this:

$ git clone git://github.com/ctb/khmer-ngram.git
Initialized empty Git repository in .* (re)
$ cd khmer-ngram
$ ls
basic.html
basic.txt
data
graphsize-book.py
hash.py
load-book.py
run-doctests.py
shred-book.py
$ python run-doctests.py basic.txt
... running doctests on basic.txt
*** SUCCESS ***

And I can run cram on it without a problem:

.
# Ran 1 tests, 0 skipped, 0 failed.

Great!

I love the regexp fix, too; none of this BS that doctest forces upon you.

So, the next question: how do multiple tests work? If you look above, you can see that it's running all the commands as one test. Logically you should be able to just separate out the block of text and make it into multiple tests... let's try adding

I'll add in another test:

  $ ls

to the khmer-ngram.t file; does that work? It looks promising:

!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
@@ -17,3 +17,12 @@
 I'll add in another test:

   $ ls
+  basic.html
+  basic.txt
+  data
+  graphsize-book.py
+  hash.py
+  hash.pyc
+  load-book.py
+  run-doctests.py
+  shred-book.py

# Ran 1 tests, 0 skipped, 1 failed.

and it sees two tests... but, after fixing the expected output using 'cram -i', I only get one test:

.
# Ran 1 tests, 0 skipped, 0 failed.

So it seems like a little internal inconsistency in cram here. Two tests when something's failing, one test when both are running. No big deal in the end.

And... I have to admit, that's about all I need for testing/checking course materials! The cram test format is perfectly compatible with ReStructuredText, so I can go in and write real documents in it, and then test them. Command line testing FTW?

And (I just checked) I can even put in Python commands and run doctest on the same file that cram runs on. Awesome.

Critique:

The requirement for two spaces exactly before the $ was not obvious to me, nor was the implicit (and silent, even in verbose mode) use of a temp directory. I wiped out my test file a few times by answering "yes" to patching, too. What was up with the 'reversed patch' foo?? And of course it'd be nice if the number of dots reflected something more granular than the number of files run. But heck, it mostly just works! I didn't even look at the source code at all!

Verdict: a tentative 8/10 on the "Can titus use your testing tool?" scale.

I'll try using it in anger on a real project next time I need it, and report back from there.

--titus

p.s. To try out my full cram test from above, grab the file from the khmer-ngram repo at github; see:

https://github.com/ctb/khmer-ngram/blob/master/cram-test.t .

Syndicated 2011-03-14 01:46:48 from Titus Brown

My new data analysis pipeline code

First, I write a recipe file, 'metagenome.recipe', laying out my job description for, say, sequence trimming and assembly with Velvet:

fasta_file soil-data.fa

qc_filter min_length=50 remove_Ns=true

graph_filter min_length=400

velvet_assemble k=33 min_length=1000 scaffolding=True

Then I specify machine parameters, e.g. 'bigmem.conf':

[defaults]
n_threads=8

[graph_filtering]
use_mem=32gb

[velvet]
needs_mem=64gb

And finally, I run the pipeline:

% ak-run metagenome.recipe -c bigmem.conf

If I have cloud access (and who doesn't?) I can tell the pipeline to spin up and down nodes as needed:

% ak-aws-run metagenome.recipe -c bigmem.conf

(Bear in mind most of these tasks are multi-hour, if not multi-day, operations, so I'm not too worried about optimizing machine use and re-use.)

Hadoop jobs could be spawned underneath, depending on how each recipe component was actually implemented.

As for testing reproducibility of pipeline results, which is the short-term motivation here, I can store results for regression testing with later versions:

% ak-run metagenome.recipe -c bigmem.conf --save-endpoint=/some/path

and then compare:

% ak-run --check-endpoint=/some/path

---

Now, does anyone know of a package or packages that actually do this, so I/we don't have to write it??

See texttest and ruffus for some of my inspiration/interest.

--titus

Syndicated 2011-03-11 14:56:29 from Titus Brown

The sky is falling! The sky is falling!

I just parachuted in on (and heli'd out of?) the Beyond the Genome conference in Boston. I gave a very brief workshop on using EC2 for sequence analysis, which seemed well received. (Mind you, virtually everything possible went wrong, from lack of good network access to lack of attendee computers to truncated workshop periods due to conference overrun, but I'm used to the Demo Effect.)

After attending the last bit of the conference, I think that "the cloud" is actually a really good metaphor for what's happening in biology these days. We have an immense science-killing asteroid heading for us (in the form of ridiculously vast amounts of sequence data, a.k.a. "sequencing bonanza" -- our sequencing capacity is doubling every 6-10 months), and we're mostly going about our daily business because we can't see the asteroid -- it's hidden by the clouds! Hence "cloud computing": computing in the absence of clear vision.

But no, seriously. Our current software and algorithms simply won't scale to the data requirements, even on the hardware of the future. A few years ago I thought that we really just needed better data management and organization tools. Then Chris Lee pointed out how much a good algorithm could do -- cnestedlist, in pygr, for doing fast interval queries on extremely large databases. That solved a lot of problems for me. And then next-gen sequencing data started hitting me and my lab, and kept on coming, and coming, and coming... A few fun personal items since my Death of Sequencing Centers post:

  • we managed to assemble some 50 Gb of Illumina GA2 metagenomic data using a novel research approach, and then...

  • ...we received 1.6 Tb of HiSeq data from the Joint Genome Institute as part of the same Great Plains soil sequencing project. It's hard not to think that our collaborators were saying "So, Mr. Smarty Pants -- you can develop new approaches that work for 100 Gb, eh? Try this on for size! BWAHAHAHAHAHAHA!"

  • I've been working for the past two weeks to do a lossy (no, NOT "lousy") assembly of 1.5 billion reads (100 Gb) of mRNAseq Illumina data from lamprey, using a derivative research approach to the one above, and then...

  • ...our collaborators at Mt. Sinai Medical Center told us that that we could expect 200 Gb of lamprey mRNA HiSeq data from their next run.

    (See BWAHAHAHAHAHAHAHA, above.)

Honestly, I think most people in genomics, much less biology, don't appreciate the game-changing nature of the sequencing bonanza. In particular, I don't think they realize the rate at which it's scaling. Lincoln Stein had a great slide in his talk at the BTG workshop about the game-changing nature of next-gen sequencing:

http://ivory.idyll.org/permanent/lstein-ngs-capacity.png

The blue line is hardware capacity, the yellow line is "first-gen" sequencing (capillary electrophoresis), and the red line is next-gen sequencing capacity.

It helps if you realize that the Y axis is log scale.

Heh, yeah.

Now, reflect upon this: many sequence analysis algorithms (notably, assembly, but also multiple sequence alignment and really anything that doesn't rely on a fixed-size "reference") are supra-linear in their scaling.

Heh, yeah.

We call this "Big Data", yep yep.

At the cloud computing workshop, I heard someone -- I won't say who, because even though I picked specifically on them, it's a common misconception -- compare computational biology to physics. Physicists and astronomers had to learn how to deal with Big Data, right? So we can, too! Yeah, but colliders and telescopes are big, and expensive. Sequencers? Cheap. Almost every research institution I know has at least one, and often two or three. Every lab I know either has some Gb-sized data set or is planning to generate 1-40 Gb within the next year. Take that graph above, and extrapolate to 2013 and beyond. Yeah, that's right -- all your base belong to us, physical scientists! Current approaches are not going to scale well for big projects, no matter what custom infrastructure you build or rent.

The closest thing to this dilemma that I've read about is in climate modeling (see: Steve Easterbrook's blog, Serendipity). However, I think the difference with biology is that we're generating new scientific data, not running modeling programs that generate our data. Having been in both situations, I can tell you that it's very different when your data is not something you can decline to measure, or something you can summarize and digest more intelligently while generating it.

I've also heard people claim that this isn't a big problem compared to, say, the problems that we face with Big Data on the Internet. I think the claim at the time was that "in biology, your data is more structured, and so you haff vays of dealing with it". Poppycock! The unknown unknowns dominate, everyone: we often don't know what we're looking for in large-scale biological data. When we do know, it's a lot easier to devise data analysis strategies; but when we don't really know, people tend to run a lot of different analyses, looking looking looking. So in many ways we end up with an added polynomial-time exploratory computation scaling (trawling through N databases with M half-arsed algorithms) on top of all the other "stuff" (Big Data, poorly scaling algorithms).

OK, OK, so the sky is falling. What do we do about it?

I don't see much short-term hope in cross-training more people, despite my efforts in that direction (see: next-gen course, and the BEACON course). Training is a medium-term effort: necessary but not all that helpful in the short term.

It's not clear that better computational science is a solution to the sequencing bonanza. Yes, most bioinformatics software has problems, and I'm sure most analyses are wrong in many ways -- including ours, before you ask. It's a general problem in scientific computation, and it's aggravated by a lack of training, and we're working on that, too, with things like Software Carpentry.

I do see two lights at the end of the tunnel, both spurred by our own research and that of Michael Schatz (and really the whole Salzberg/Pop gang) as well as Narayan Desai's talk at the BTG workshop.

First, we need to change the way analysis scales -- see esp. Michael Schatz's work on assembly in the cloud, and (soon, hopefully) our own work on scaling metagenomic and mRNAseq assembly. Michael's code isn't available (tsk tsk) and ours is available but isn't documented, published, or easy to use yet, but we can now do "exact" assemblies of 100 Gb of metagenomic, and we're moving towards nearly-exact assemblies of arbitrarily large RNAseq and metagenomic data sets. (Yes, "arbitrary". Take THAT, JGI.)

We will have to do this kind of algorithmic scaling on a case-by-case basis, however. I'm focused on certain kinds of sequence analysis, personally, but there's a huge world of problems out there that will need constant attention to scale them in the face of the new data. And right now, I don't see too many CSE people focused on this, because they don't see the need to scale to Tb.

Second, Big Data and cloud computing are, combined, going to dynamite the traditional HPC model and make it clear that our only hope is to work smarter and develop better algorithms, in combination with scaling compute power. How so?

As Narayan has eloquently argued many times, it no longer makes sense for most institutions to run their own HPC, if you take into account the true costs of power, AC, and hardware. The only reason it looks like HPCs work well is because of the way institutions play games with funny money (a.k.a. "overhead charges"), channeling it to HPC behind the scenes - often with much politicking involved. If, as a scientist, your compute is "free" or even heavily subsidized, you tend not to think much about it. But now that we have to scale those clusters 10s or 100s or 1000s of X, to deal with data 100s or 1e6s of times as big, institutions will no longer be able to afford to build their own clusters with funny money. And they'll have to charge scientists for the true computational cost of their work -- or scientists will have to use the cloud. Either way, people will be exposed to how much it really costs to run, say, BLAST against 100,000,000 short reads. And soon thereafter they'll stop doing such foolish things.

In turn, this will lead to a significant effort to make better use of hardware, either by building better algorithms or asking questions more cleverly. (Goodbye, BLAST!) It hurts me to say that, because I'm not algorithmically focused by nature; but if you want to know the answer to a biological question, and you have the data, but existing approaches can't handle it within your budget... what else are you going to do but try to ask, or answer, the question more cleverly? Narayan said something like "we'll have to start asking if $150/BLAST is a good deal or not" which, properly interpreted, makes the point well: it's a great deal if you have $1000 and only one BLAST to do, but what if you have 500 BLASTs? And didn't budget for it?

Fast, cheap, good. Choose two.

Better algorithms and more of a focus on their importance (due to the exposure of true costs) are two necessary components to solving this problem, and there are increasingly many people working on them. So I think there are these two lights at the end of the tunnel for the Big Data-in-biology challenges. And probably there are some others that I'm missing. Although, of course, these two lights at the end of tunnel may be train headlights, but we can hope, right?

--titus

p.s. Chris Dagdigian from BioTeam gave an absolutely awesome talk on many of these issues, too. Although he seems more optimistic than I am, possibly because he's paid by the hour :).

Syndicated 2010-10-14 21:25:51 from Titus Brown

A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets

I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!

Introduction

Briefly, the algorithmic challenge is this:

We have a bunch of DNA sequences in (let's say) FASTA format,

>850:2:1:1145:4509/1
CCGAGTCGTTTCGGAGGGACCCCGCCATCATACTCGGGGAATTCATCTGAAAGCATGATCATAGAGTCACCGAGCA
>850:2:1:1145:4509/2
AGCCAAGAGCACCCCAGCTATTCCTCCCGGACTTCATAACGTAACGGCCTACCTCGCCATTAAGACTGCGGCGGAG
>850:2:1:1145:14575/1
GACGCAAAAGTAATCGTTTTTTGGGAACATGTTTTCATCCTGATCATGTTCCTGCCGATTCTGATCTCGCGACTGG
>850:2:1:1145:14575/2
TAACGGGCTGAATGTTCAGGACCACGTTTACTACCGTCGGGTTGCCATACTTCAACGCCAGCGTGAAAAAGAACGT
>850:2:1:1145:2219/1
GAAGACAGAGTGGTCGAAAGCTATCAGACGCGATGCCTAACGGCATTTTGTAGGGCGTCTGCGTCAGACGCCAACC
>850:2:1:1145:2219/2
GAAGGCGGGTAATGTCCGACAAACATTTGACGTCAAAGCCGGCTTTTTTGTAGTGGGTTTGACTCTTTCGCTTCCG
>850:2:1:1145:5660/1
GATGGCGTCGTCCGGGTGCCCTGGTGGAAGTTGCGGGGATGCGGATTCATCCGGGACGCGCAGACGCAGGCGGTGG

and we want to pre-filter these sequences so that only sequences containing high-abundance DNA words of length k ("k-mers"), remain. For example, given a set of DNA sequences, we might want to remove any sequence that does not contain a k-mer present at least 5 times in the entire data set.

This is very straightforward to do for small numbers of sequences, or for small k. Unfortunately, we are increasingly confronted by data sets containing 250 million sequences (or more), and we would like to be able to do this for large k (k > 20, at least).

You can break the problem down into two basic steps: first, counting k-mers; and second, filtering sequences based on those k-mer counts. It's not immediately obvious how you would parallelize this task: the counting should be very quick (basically, it's I/O bound) while the filtering is going to rely on wide-reaching lookups that aren't localized to any subset of k-mer space.

tl; dr? we've developed a way to do this for arbitrary k, in linear time and constant memory, efficiently utilizing as many computers as you have available. It's open source and works today, but, uhh, could use some documentation...

Digression: the bioinformatics motivation

(You can skip this if you're not interested in the biological motivation ;)

What we really want to do with this kind of data is assemble it, using a De Bruijn graph approach a la Velvet, ABySS, or SOAPdenovo. However, De Bruijn graphs all rely on building a graph of overlapping k-mers in memory, which means that their memory usage scales as a function of the number of unique k-mers. This is good in general, but Bad in certain circumstances -- in particular, whenever the data set you're trying to assemble has a lot of genomic novelty. (See this fantastic review and my assembly lecture for some background here.)

One particular circumstance in which De Bruijn graph-based assemblers flail is in metagenomics, the isolation and sequencing of genetic material from "the wild", e.g. soil or the human gut. This is largely because the diversity of bacteria present in soil is so huge (although the relatively high error rate of next-gen platforms doesn't help).

Prefiltering can help reduce this memory usage by removing erroneous sequences as well as not-so-useful sequences. First, any sequence arising as the result of a sequencing error is going to be extremely rare, and abundance filtering will remove those. Second, genuinely "rare" (shallowly-sequenced) sequences will generally not contribute much to the assembly, and so removing them is a good heuristic for reducing memory usage.

We are currently playing with dozens (probably hundreds, soon) of gigabytes of metagenomic data, and we really need to do this prefiltering in order to have a chance at getting a useful assembly out of it.

It's worth noting that this is in no way an original thought: in particular, the Beijing Genome Institute (BGI) did this kind of prefiltering in their landmark Human Microbiome paper (ref). We want to do it, too, and the BGI wasn't obliging enough to give us source code (booooooo hisssssssssssssss).

Existing approaches

Existing approaches are inadequate to our needs, as far as we can tell from a literature survey and private conversations. Everyone seems to rely on big-RAM machines, which are nice if you have them, but shouldn't be necessary.

There are two basic approaches.

First, you can build a large table in memory, and then map k-mers into it. This involves writing a simple hash function that converts DNA words into numbers. However, this approach scales poorly with k: for example, there are 4**k unique DNA sequences of length k (or roughly (4**k) / 2 + (4**(k/2))/2, considering reverse complements). So the table for k=17 needs 4**17 entries -- that's 17 gb at 1 byte per counter, which stretches the memory of cheaply available commodity hardware. And we want to use bigger k than 17 -- most assemblers will be more effective for longer k, because it's more specific. (We've been using k values between 30 and 70 for our assemblies, and we'd like to filter on the same k.)

Second, you can observe that k-mer space (for sufficiently large k) is likely to be quite sparsely occupied -- after all, there's only so many actual 30-mers present in a 100gb data set! So, you can do something clever like use a tree representation of k-mers and then attach counters to the end nodes of the tree (see, for example, tallymer. The problem here that you need to use pointers to connect nodes in the tree, which means that while the tree size is going to scale with the amount of novel k-mers -- ok! -- it's going to have a big constant in front of it -- bad!. In our experience this has been prohibitive for real metagenomic data filtering.

These seem to be the two dominant approaches, although if you don't need to actually count the k-mers but only assess presence or absence, you can use something like a Bloom filter -- for example, see Classification of DNA sequences using a Bloom filter, which uses Bloom filters to look for novel sequences (the exact opposite of what we want to do here!). References to other approaches welcome...

Note that you really, really, really want to avoid disk access by keeping everything in memory. These are ginormous data sets and we would like to be able to quickly explore different parameters and methods of sequence selection. So we want to come up with a good counting scheme for k-mers that involves relatively small amounts of memory and as little disk access as possible.

I think this is a really fun kind of problem, actually. The more clever you are, the more likely you are to come up with a completely inappropriate data structure, given the amount of data and the basic algorithmic requirements. It demands KISS! Read on...

An approximate approach to counting

So, we've come up with a solution that scales with the amount of genomic novelty, and efficiently uses your available memory. It can also make effective use of multiple computers (although not multiple processors). What is this magic approach?

Hash tables. Yep. Map k-mers into a fixed-size table (presumably one about as big as your available memory), and have the table entries be counters for k-mer abundance.

This is an obvious solution, except for one problem: collisions. The big problem with hash tables is that you're going to have collisions, wherein multiple k-mers are mapped into a single counting bin. Oh noes! The traditional way to deal with this is to keep track of each k-mer individually -- e.g. when there's a collision, use some sort of data structure (like a linked list) to track the actual k-mers that mapped to a particular bin. But now you're stuck with using gobs of memory to keep these structures around, because each collision requires a new pointer of some sort. It may be possible to get around this efficiently, but I'm not smart enough to figure out how.

Instead of becoming smarter, I reconfigured my brain to look at the problem differently. (Think Different?)

The big realization here is that collisions may not matter all that much. Consider the situation where you're filtering on a maximum abundance of 5 -- that is, you want at least one k-mer in each sequence to be present five or more times across the data set. Well, if you look at the hash bin for a specific k-mer and see the number 4, you immediately know that whether or not there are any collisions, that particular k-mer isn't present five or more times, and can be discarded! That is, the count for a particular hash bin is the sum of the (non-negative) individual counts for k-mers mapping to that bin, and hence that sum is an upper bound on any individual k-mer's abundance.

http://ivory.idyll.org/permanent/kmer-hashtable.png

The tradeoff is false positives: as k-mer space fills up, the hash table is going to be hit by more and more collisions. In turn, more of the k-mers are going to be falsely reported as being over the minimum abundance, and more of the sequences will be kept. You can deal with this partly by doing iterative filtering with different prime hash table sizes, but that will be of limited utility if you have a really saturated hash table.

Note that the counting and filtering is still O(N), while the false positives grow as a function of k-mer space occupancy -- which is to say that the more diversity you have in your data, the more trouble you're in. That's going to be a problem no matter the approach, however.

You can see a simple example of approximate and iterative filtering here, for k=15 (a k-mer space of approximately 1 billion) and hash table sizes ranging from 50m to 100m. The curves all approach the correct final number of reads (which we can calculate exactly, for this data set) but some take longer than others. (The code for this is here.)

http://ivory.idyll.org/permanent/kmer-filtering-iterative.png

Making use of multiple computers

While I was working out the details of the (really simple) approximate counting approach, I was bugged by my inability to make effective use of all the juicy computers to which I have access. I don't have many big computers, but I do have lots of medium sized computers with memory in the ~10-20 gb range. How can I use them?

This is not a pleasantly parallel problem, for at least two reasons. First, it's I/O bound -- reading the DNA sequences in takes more time than breaking them down into k-mers and counting them. And since it's really memory bound -- you want to use the biggest hash table possible to minimize collisions -- it doesn't seem like using multiple processors on a single machine will help. Second, the hash table is going to be too big to effectively share between computers: 10-20 gb of data per computer is too much to send over the network. So what do we do?

I was busy explaining to a colleague that this was impossible -- always a useful way for me to solve problems ;) -- when it hit me that you could use multiple chassis (RAM + CPU + disk) to decrease the false positive rate with only a small amount of communication overhead. Basically, my approach is to partition k-mer space into Z subsets (one for each chassis) and have each computer count only the k-mers that fall into their partition. Then, after the counting stage, each chassis records a max k-mer abundance per partition per sequence, and communicates that to a central node. This central node in turn calculates the max k-mer abundance across all partitions.

The partitioning trick is a more general form of the specific 'prefix' approach -- that is, separately count and get max abundances/sequence for all k-mers starting with 'A', then 'C', then 'G', and then 'T'. For each sequence you will then have four values (the max abundance/sequence for k-mers start with 'A', 'C', 'G', and 'T'), which can be cheaply stored on disk or in memory. Now you can do a single-pass integration and figure out what sequences to keep.

This approach effectively multiplies your working memory by a factor of Z, decreasing your false positive rate equivalently - no mean feat!

http://ivory.idyll.org/permanent/kmer-hashtable-par.pnghttp://ivory.idyll.org/permanent/kmer-filter-process-2.png

The communication load is significant but not prohibitive: replicate a read-only sequence data set across Z computers, and then communicate max values (1 byte for each sequence) back -- 50-500 mb of data per filtering round. The result of each filtering round can be returned to each node as a readmask against the already-replicated initial sequence set, with one bit per sequence for ignore or process. You can even do it on a single computer, with a local hard drive, in multiple iterations.

You can see a simple in-memory implementation of this approach here, and some tests here. I've implemented this using readmask tables and min/max tables (serializable data structures) more generally, too; see "the actual code", below.

Similar approaches

By allowing for false positives, I've effectively turned the hash table into a probabilistic data structure. The closest analog I've seen is the Bloom filter which records presence/absence information using multiple hash functions for arbitrary k. The hash approach outlined above devolves into a maximally efficient Bloom filter for fixed k when only presence/absence information is recorded.

The actual code

Theory and practice are the same, in theory. In practice, of course, they differ. A whole host of minor interface and implementation design decisions needed to be made. The result can be seen at the 'khmer' project, here: http://github.com/ctb/khmer. It's slim on documentation, but there are some example scripts and a reasonable amount of tests. It requires nothing but Python 2.6 and a compiler; nose if you want to run the tests.

The implementation is in C++ with a Python wrapper, much like the paircomp and motility software packages.

It will filter 1m 70-mer sequences in about 45 seconds, and 80 million sequences in less than an hour, on a 3 GHz Xeon with 16 gbs of RAM.

Right now it's limited to k <= 32, because we encode each DNA k-mer in a 64-bit 'long long'.

You can see an exact filtering script here: http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py . By varying the hash table size (second arg to new_hashtable) you can turn it into an inexact filtering script quite easily.

Drop me a note if you want help using the code, or a specific example. We're planning to write documentation, doctests, examples, robust command line scripts, etc. prior to publication, but for now we're primarily trying to use it.

Other uses

It has not escaped our notice that you can use this approach for a bunch of other k-mer based problems, such as repeat discovery and calculating abundance distributions... but right now we're focused on actually using it for filtering metagenomic data sets prior to assembly.

Acks

I talked a fair bit with Prof. Rich Enbody about this approach, and he did a wonderful job of double-checking my intuition. Jason Pell and Qingpeng Zhang are co-authors on this project; in particular, Jason helped write the C++ code, and Qingpeng has been working with k-mers in general and tallymer in specific on some other projects, and provided a lot of background help.

Conclusions

We've taken a previously hard problem and made it practically solvable: we can filter ~88m sequences in a few hours on a single-processor computer with only 16gb of RAM. This seems useful.

Our main challenge now is to come up with a better hashing function. Our current hash function is not uniform, even for a uniform distribution of k-mers, because of the way it handles reverse complements.

The approach scales reasonably nicely. Doubling the amount of data doubles the compute time. However, if you have double the novelty, you'll need to do double the partitions to keep the same false positive rate, in which case you quadruple the compute time. So it's O(N^2) for the worst case (unending novelty) but should be something better for real-world cases. That's what we'll be looking at over the next few months.

I haven't done enough background reading to figure out if our approach is particularly novel, although in the space of bioinformatics it seems to be reasonably so. That's less important than actually solving our problem, but it would be nice to punch the "publication" ticket if possible. We're thinking of writing it up and sending it to BMC Bioinformatics, although suggestions are welcome.

It would be particularly ironic if the first publication from my lab was this computer science-y, given that I have no degrees in CS and am in the CS department by kind of a fluke of the hiring process ;).

Syndicated 2010-07-07 15:09:20 from Titus Brown

Teaching scientists how to use computers - hub & spokes

After my recent next-gen sequencing course, which was supposed to tie into the whole software carpentry (SWC) effort but didn't really succeed in doing so the first time through, I started thinking about the Right Way to tie in the SWC material. In particular, how do you both motivate scientists to look at the SWC material, and (re)direct people to the appropriate places?

It's not clear that a Plan is in place. Greg Wilson seems to assume that scientists will find at least some of the material immediately obviously usable, but I think he's targetted at a more sophisticated population of users -- physicists and the like. My experience with bioinformaticians, however, is that they either come from straight biology backgrounds (with little or no computational background and rather limited on-the-job training), straight computation backgrounds (with very little biology), or physics (gonzo programming skills, but no biology). The latter fit neatly into the SWC fold, but they (we ;) are rare in biology. I think computer scientists and biologists are going to need guidance to dive into SWC at an early enough time for it to be the most rewarding.

So, what's a good model for SWC to guide scientists from multiple disciplines into the appropriate material? It's obviously not going to be possible to have Greg et al. tailor the SWC material to individual subgroups -- he doesn't know much (any ;) biology, for example. I don't have the time, patience, or skillset to integrate my next-gen notes into his SWC material, either. So, instead, I propose the hub & spokes model!

http://ivory.idyll.org/permanent/hub-spokes.png

Here, the "hub" is the SWC material, and the spokes are all of the individual disciplines.

Basically, the idea is that individual sites (like my own ANGUS site on next-gen sequencing, http://ged.msu.edu/angus/) will develop their own field-specific content, and then link from that content into the SWC notes. This way the experts with feet in both fields can link appropriately, and Greg only has to worry about making the central content general -- which he's already doing quite well, I think. Yes, It's more work than asking Greg to do it, but frankly I'm going to be happy with a kick-ass central SWC site to which I can link -- right now it's dismayingly challenging to teach students why this stuff matters and how to learn it.

From the psychosocial perspective, it's a great fit. Students can get hands on tutorials on how to do X, Y, and Z in their own field -- and then connect into the SWC material to learn the background, or additional computational techniques in support of it. Motivation first!

What do we need SWC to do to support this? Not much -- basically, the central SWC notes need to be stable enough (with permalinks) that I can link into them from my own site(s) and not have to worry about the links becoming broken or (worse) silently migrating in topic. There are other solutions (wholesale incorporation of SWC into my own notes, for example) but I think the permalink idea is the most straightforward. Oh, and we should have a Greg-gets-hit-by-a-bus plan, too; at some point he's going to move on from SWC (perhaps when his lovely wife decide she's had enough and he needs to stop obsessing over it, or perhaps under more dire circumstances ;( and it would be good to know who holds the domain and site keys.

Thoughts? Comments?

--titus

Syndicated 2010-07-06 03:12:24 from Titus Brown

Which functional programming language(s) should we teach?

Laurie Dillon just posted the SIGPLAN eduction board article on Why Undergraduates Should Learn the Principles of Programming Languages to our faculty mailing list at the MSU Computer Science department. One question that came up in the ensuing conversation was: what functional programming language(s) would/should we teach?

I mentioned OCaml, Haskell, and Erlang as reasonably pure but still pragmatic FP languages. Anything else that's both "truly" functional and used somewhat broadly in the real world?

thanks!

--titus

Syndicated 2010-06-24 18:31:48 from Titus Brown

Teaching next-gen sequencing data analysis to biologists

Our sequencing analysis course ended last Friday, with an overwhelmingly positive response from the students. The few negative comments that I got were largely about organizational issues, and could be reshaped as suggestions for next time rather than as condemnations of this year's course.

http://ivory.idyll.org/permanent/ngs-2010-group.png

The 23 students -- most with no prior command-line experience -- spent two weeks experiencing at first hand the challenges of dealing with dozens of gigabytes of sequencing data. Each of the students went through genome-scale mapping, genome assembly, mRNAseq analysis on an "emerging model organism" (a.k.a "one with a crappy genome", lamprey), resequencing analysis on E. coli, and ChIP-seq analysis on Myxococcus xanthus. By the beginning of the second week, many students were working with their own data -- a real victory. Python programming competency may take a bit longer, but many of them seem motivated.

If you had told me three weeks ago that we could pull this off, I would have told you that you were crazy. This does beg the question of what I was thinking when I proposed the course -- but don't dwell on that, please...

The locale was great, as you can see:

http://ivory.idyll.org/permanent/ngs-2010-beach.png

One of the most important lessons of the course for me is that cloud computing works well to backstop this kind of course. I was very worried about the suitabiliy and reliability and ease of use, but AWS did a great job, providing an easy-to-use Web interface and a good range of machine images. I have little doubt that this course would have been nearly impossible (and either completely ineffective or much more expensive) without it.

In the end, we spent more on beer than on computational power. That says something important to me :)

The course notes are available under a CC license although they need to be reworked to use publicly available data sets before they become truly useful. At that point I expect them to become awesomely useful, though.

From the scientific perspective, the students derived a number of significant benefits from the course. One that I had not really expected was that some students had no idea what went in to computational "sausage", and were kind of shocked to see what kinds of assumptions us comp bio people made on their behalf. This was especially true in the case of students from companies, who have pipelines that are run on their data. One student lamented that "we used to look at the raw traces... now all we get are spreadsheet summaries!" Another student came to me in a panic because they didn't realize that there was no one true answer -- that that was in fact part of the "fun" of all biology, not just experimental biology. These reactions alone made teaching the course worthwhile.

Of course, the main point is that many of the students seem to be capable of at least starting their own analyses now. I was surprised at the practical power of our cut-and-paste approach -- for example, if you look at the Short-read assembly with ABySS tutorial, it turns out to be relatively straightforward to adapt this to doing assemblies of your own genomic or transcriptomic data. I based our approach on Greg Wilson's post on the failure of inquiry-based teaching and so far I like it.

I am particularly amused that we have now documented, in replicable detail, the Kroos Lab MrpC ChIP analysis. We also have the best documentation for Jeff Barrick's breseq software, I think; this is what is used to analyze the Long Term Evolution Experiment lines -- and I can't wait for the anti-evolutionists to pounce on that... "Titus Brown -- making evolution experiments accessible to creationists." Yay?

There were a number of problems and mistakes that we had to steamroller through. In particular, more background and more advanced tutorials would have be great, but we just didn't have time to write them. Some 454, Helicos, and SOLiD data sets (and next year, PacBio?) would be a good addition. We had a general lack of multiplexing data, which is becoming a Big Thing now that sequencing is so ridiculously deep. I would also like to introduce additional real data analyses next year, reprising things like the Cufflinks analysis and whole-vertebrate-genome ChIP-seq/mRNAseq a la the Wold Lab. I'm weighing adding metagenomics data analysis in for a day, although it's a pretty separate field of inquiry (and frankly much harder in terms of "unknown unknowns"). We also desperately need some plant genomics expertise, because frankly I know nothing about plant genomes; my last-minute plant genomics TA fell through due to lack of planning on my part. (Conveniently, plant genomics is something MSU is particularly good at, so I'm sure I can find someone next year.)

Oops, did I say next year? Well, yes. If I can find funding for my princely salary, then I will almost certainly run the course again next year. I can cover TAs and my own room/board and speakers with workshop fees, but if I'm going to keep room+board+fees under $1000/student -- a practical necessity for most -- there's no way I can pay myself, too. And while this year I relied on my lovely, patient, and frankly long-suffering wife to hold down the home fort while I was away for two weeks, I simply can't put her through that again, so I will need to pay for a nanny next year. So doing it for free is not an option.

In other words, if you are a sequencing company, or an NIH/NSF/USDA program director, interested in keeping this going, please get in touch. I plan to apply for this Initiative to Maximize Research Education in Genomics in September, but I am not confident of getting that on the first try, and in any case I will need letters of support from interested folks. So drop me a note at ctb@msu.edu.

Course development this year was sponsored by the MSU Gene Expression in Disease and Development, to whom I am truly grateful. The course would simply not have been possible without their support.

My overall conclusion is that it is possible to teach bench biologists with no prior computational experience to achieve at least minimal competency in real-world data analysis of next-generation sequencing data. I can't conclusively demonstrate this without doing a better job of course evaluation, and of course only time will tell if it sticks for any of the students, but right now I'm feeling pretty good about the course overall. Not to mention massively relieved.

--titus

p.s. Update from one student -- "It's not even 12 o'clock Monday morning and I'm already getting people asking me how to run assemblies and analyze data." Heh.

Syndicated 2010-06-14 15:38:31 from Titus Brown

Running a next-gen sequence analysis course using Amazon Web Services

So, I've been teaching a course on next-generation sequence analysis for the last week, and one of the issues I had to deal with before I proposed the course was how to deal with the volume of data and the required computation.

You see, next-generation sequence analysis involves analyzing not just entire genomes (which are, after all, only 3gb or so in size) but data sets that are 100x or 1000x as big! We want to not just map these data sets (which is CPU-intensive), but also perform memory-intensive steps like assembly. If you have a class with 20+ students in it, you need to worry about a lot of things:

  • computational power: how do you provide 24 "good" workstations
  • memory
  • disk space
  • bandwidth
  • "take home" ability

One strategy would be to simply provide some Linux or Mac workstations, with cut-down data sets. But then you wouldn't be teaching reality -- you'd be teaching a cut-down version of reality. This would make the course particularly irrelevant given that one of the extra-fun things about next-gen sequence analysis is how hard it is to deal with the volume of data. You also have to worry that the course would be made even more irrelevant because the students would leave the course and be unable to use the information without finding infrastructure and installing a bunch of software and then administering the machine.

While I enjoy setting up computers and installing software and managing users, I'm clearly masochistic. It's also entirely besides the point for bioinformaticians and biologists - they just want to analyze data!

The solution I came up with was to use Amazon Web Services and rent some EC2 machines. There's a large variety of hardware configurations available (see instance types) and they're not that expensive per hour (see pricing).

This has worked out really, really well.

It's hard to enumerate the benefits, because there have been so many of them ;). A few of the obvious ones --

We've been able to write tutorials (temporary home here: http://ged.msu.edu/angus/) that make use of specific images and should be as future-proof as they can be. We've given students cut and paste command lines that Just Work, and that they can tweak and modify as they want. If it borks, they always just throw it away and start from a clean install.

It's dirt cheap. We spent less than $50 the first week, for ~30 people using an average of 8 hours of CPU time. The second week will increase to an average of 8 hours of CPU time a day, and for larger instances -- so probably about $300 total, or maybe even $500 -- but that's ridiculously cheap, frankly, when you consider that there are no hardware issues or OS re-install problems to deal with!

Students can choose whatever machine specs they need in order to do their analysis. More memory? Easy. Faster CPU needed? No problem.

All of the data analysis takes place off-site. As long as we can provide the data sets somewhere else (I've been using S3, of course) the students don't need to transfer multi-gigabyte files around.

The students can go home, rent EC2 machines, and do their own analyses -- without their labs buying any required infrastructure.

Home institution computer admins can use the EC2 tutorials as documentation to figure out what needs to be installed (and potentially, maintained) in order for their researchers to do next-gen sequence analysis.

The documentation should even serve as a general set of tutorials, once I go through and remove the dependence on private data sets! There won't be any need for students to do difficult or tricky configurations on their home machines in order to make use of the tutorial info.

So, truly awesome. I'm going to be using it for all my courses from now on, I think.

There have been only two minor hitches.

First, I'm using Consolidated Billing to pay for all of the students' computer use during the class, and Amazon has some rules in place to prevent abuse of this. They're limiting me to 20 consolidated billing accounts per AWS account, which means that I've needed to get a second AWS account in order to add all 30 students, TAs, and visiting instructors. I wouldn't even mention it as a serious issue but for the fact that they don't document it anywhere, so I ran into this on the first day of class and then had to wait for them to get back to me to explain what was going on and how to work around it. Grr.

Second, we had some trouble starting up enough Large instances simultaneously on the day we were doing assembly. Not sure what that was about.

Anyway, so I give a strong +1 on Amazon EC2 for large-ish style data analysis. Good stuff.

cheers, --titus

Syndicated 2010-06-08 14:52:29 from Titus Brown

454 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!