Older blog entries for titus (starting at number 474)

PyCon 2013 talks I really don't want to see

There's been a lot of discussion about PyCon talks that we do want to see. Here's a brief list of those I don't want to see, for those of you considering a submission -- in no particular order.

  1. 1001 Mocking Frameworks - a comparison and overview
  2. Long Live Twill
  3. Zope vs. Django - why Zope was right all along
  4. Why We Need More Men in Python - a Diversity discussion
  5. Centralized Version Control - it's the future
  6. Guido van Rossum on Python 4k - it's the future
  7. Running Python under Windows on my Superboard II
  8. Lists - way more useful than you ever thought
  9. What the Python Community Can Learn from Java
  10. Solving Easy Problems - my very own customized approaches

Any other ideas? Add 'em or send me links :)

--titus

Syndicated 2012-08-23 22:00:00 from Living in an Ivory Basement

Welcome to my new blog!

I've just moved my blog over to Pelican, a static blog generator that takes in reStructuredText and spits out, well, this! I'm now using Disqus for commenting, too.

The main motivations for the move (apart from slightly better theming) were to escape dynamic-blog-land in favor of static-blog-land, while enabling a better commenting setup. Pelican+disqus looked like a great solution; we'll see how it goes!

One note -- rather than hack and slash my way through disqus's commenting system upload fail, I just attached all of the comments as "legacy comments" on old blog entries. Yeah, it sucks, sorry.

--titus

Syndicated 2012-06-24 22:00:00 from Living in an Ivory Basement

Some early experience in teaching using ipython notebook

As part of the 2012 Analyzing Next-Generation Sequencing Data course, I've been trying out ipython notebook for the tutorials.

In previous years, our tutorials all looked like this: Short read assembly with Velvet -- basically, reStructuredText files integrated with Sphinx. This had a lot of advantages, including Googleability and simplicity; but it also meant that students spent a lot of time copying and pasting commands.

This year, I tried mixing things up with some ipython notebook, using pre-written notebooks -- see for example a static view of the BLAST notebook. The notebooks are sourced at https://github.com/ngs-docs/ngs-notebooks, and can be automatically updated and placed on an EC2 instance for the students to run. The idea is that the students can simply shift-ENTER through the notebooks; shell commands can easily be run with '!', and we can integrate in python code that graphs and explores the outputs.

Once we got past the basic teething pains of badly written notebooks, broken delivery mechanisms, proper ipython parameters, etc., things seemed to work really well. It's been great to be able to add code, annotate code, and graph stuff interactively!

Along the way, though, a few points have emerged.

First, ipython notebook adds a little bit of confusion to the process. Even though it's pretty simple, when you're throwing it in on top of UNIX, EC2, bioinformatics, and Python, people's minds tend to boggle.

For this reason, it's not yet clear how good an addition ipynb is to the course. We can't get away with replacing the shell with ipynb, for a variety of reasons; so it represents an extra cognitive burden. I think for an entire term course it will be an unambiguous win, but for an intensive workshop it may be one thing too many.

I should have a better feeling for this next week.

Second, in practice, ipython notebooks need to be written so that they can be executed multiple times on the same machine. Workshop attendees start out very confused about the order of commands vs the order of execution, and even though ipynb makes this relatively simple, if they get into trouble it is nice to be able to tell them to just rerun the entire notebook. So the notebook commands have to be designed this way -- for one example, if you're copying a file, make sure to use 'cp -f' so that it doesn't ask if the file needs to be copied again.

Third, in practice, ipython notebooks cannot contain long commands. If the entire notebook can't be re-run in about 1 minute, then it's too long. This became really clear with Oases and Trinity, where Oases could easily be run on a small data set in about 1-2 minutes, while Trinity took an hour or more. Neither people nor browsers handle that well. Moreover, if you accidentally run the time-consuming task twice, you're stuck waiting for it to finish, and it's annoying and confusing to put multi-execution guards on tasks.

This point is a known challenge with ipython notebook, of course; I've been talking with Fernando and Brian, among others, about how to deal with long running tasks. I'm converging to the idea that long-running tasks should be run at the command line (maybe using 'make' or something better?) and then ipython notebook can be used for data analysis leading to summaries and/or visualization.

Fourth, ipython notebooks are a bit difficult to share in static form, which makes the site less useful. Right now I've been printing to HTML and then serving that HTML up statically, which is slow and not all that satisfying. There are probably easy solutions for this but I haven't invested in them ;).

---

In spite of these teething pains, feedback surrounding ipynb has been reasonably positive. Remember, these are biologists who may never have done any previous shell commands or programming, and we are throwing a lot at them; but overall the basic concepts of ipynb are simple, and they recognize that. Moreover, ipython notebook has enabled extra flexibility in what we present and make possible for them to do, and they seem to see and appreciate that.

The good news is that we figured all this out in the first week, and I still have a whole week with the guinea pigs, ahem, course attendees, under my thumb. We'll see how it goes!

--titus

p.s. Totally jonesing for a portfolio system that lets me specify a machine config, then with a single click spawns the machine, configures it, sucks down a bunch of ipython notebooks, and points me at the first one!

Syndicated 2012-06-08 18:17:00 from Titus Brown

Why I don't *really* practice open science

I'm a pretty big advocate of anything open -- open source, open access, and open science, in particular. I always have been. And now that I'm a professor, I've been trying to figure out how to actually practice open science effectively

What is open science? Well, I think of it as talking regularly about my unpublished research on the Internet, generally in my blog or on some other persistent, explicitly public forum. It should be done regularly, and it should be done with a certain amount of depth or self-reflection. (See, for example, the wunnerful Rosie Redfield and Nature's commentary on her blogging of the arsenic debacle & tests thereof.)

Most of my cool, sexy bloggable work is in bioinformatics; I do have a wet lab, and we're starting to get some neat stuff out of that (incl. both some ascidian evo-devo and some chick transcriptomics) but that's not as mature as the computational stuff I'm doing. And, as you know if you've seen any of my recent posts on this, I'm pretty bullish about the computational work we've been doing: the de novo assembly sequence foo is, frankly, super awesome and seems to solve most of the scaling problems we face in short-read assembly. And it provides a path to solving the problems that it doesn't outright solve. (I'm talking about partitioning and digital normalization.)

While I think we're doing awesome work, I've been uncharacteristically (for me) shy about proselytizing it prior to having papers ready. I occasionally reference it on mailing lists, or in blog posts, or on twitter, but the most I've talked about the details has been in talks -- and I've rarely posted those talks online. When I have, I don't point out the nifty awesomeness in the talks, either, which of course means it goes mostly unnoticed. This seems to be at odds with my oft-loudly stated position that open-everything is the way to go. What's going on?? That's what this blog post is about. I think it sheds some interesting light on how science is actually practiced, and why completely open science might waste a lot of people's time.

I'd like to dedicate this blog post to Greg Wilson. He and I chat irregularly about research, and he's always seemed interested in what I'm doing but is stymied because I don't talk about it much in public venues. And he's been a bit curious about why. Which made me curious about why. Which led to this blog post, explaining why I think why. (I've had it written for a few months, but was waiting until I posted diginorm.)


For the past two years or so, I've been unusually focused on the problem of putting together vast amounts of data -- the problem of de novo assembly of short-read sequences. This is because I work on unusual critters -- soil microbes & non-model animals -- that nobody has sequenced before, and so we can't make use of prior work. We're working in two fields primarily, metagenomics (sampling populations of wild microbes) and mRNAseq (quantitative sequencing of transcriptomes, mostly from non-model organisms).

The problems in this area are manifold, but basically boil down to two linked issues: vast underlying diversity, and dealing with the even vaster amounts of sequence necessary to thoroughly sample this diversity. There's lots of biology motivating this, but the computational issues are, to first order, dominant: we can generate more sequence than we can assemble. This is the problem that we've basically solved.

A rough timeline of our work is:

  • mid/late 2009: Likit, a graduate student in my lab, points out that we're getting way better gene models from assembly of chick mRNAseq than from reference-based approaches. Motivates interest in assembly.
  • mid/late 2009: our lamprey collaborators deliver vast amounts of lamprey mRNAseq to us. Reference genome sucks. Motivates interest in assembly.
  • mid/late 2009: the JGI starts delivering ridiculous amount of soil sequencing data to us (specifically, Adina). We do everything possible to avoid assembly.
  • early 2010: we realize that the least insane approach to analyzing soil sequencing data relies on assembly.
  • early 2010: Qingpeng, a graduate student, convinces me that existing software for counting k-mers (tallymer, specifically) doesn't scale to samples with 20 or 30 billion unique k-mers. (He does this by repeatedly crashing our lab servers.)
  • mid-2010: a computational cabal within the lab (Jason, Adina, Rose) figures out how to count k-mers really efficiently, using a CountMin Sketch data structure (which we reinvent, BTW, but eventually figure out isn't novel. o well). We implement this in khmer. (see k-mer filtering)
  • mid-2010: We use khmer to figure out just how much Illumina sequence sucks. (see Illumina read phenomenology)
  • mid-2010: Arend joins our computational cabal, bringing detailed and random knowledge of graph theory with him. We invent an actually novel use of Bloom filters for storing de Bruijn graphs. (blog post) The idea of partitioning large metagenomic data sets into (disconnected) components is born. (Not novel, as it turns out -- see MetaVelvet and MetaIDBA.)
  • late 2010: Adina and Rose figure out that Illumina suckage prevents us from actually getting this to work.
  • first half of 2011: Spent figuring out capacity of de Bruijn graph representation (Jason/Arend) and the right parameters to actually de-suckify large Illumina data sets (Adina). We slowly progress towards actually being able to partition large metagenomic data sets reliably. A friend browbeats me into applying the same technique to his ugly genomic data set, which magically seems to solve his assembly problems.
  • fall 2011: the idea of digital normalization is born: throwing away redundant data FTW. Early results are very promising (we throw away 95% of data, get identical assembly) but it doesn't scale assembly as well as I'd hoped.
  • October 2011: JGI talk at the metagenome informatics workshop - SLYT, where we present our ideas of partitioning and digital normalization, together, for the first time. We point out that this potentially solves all the scaling problems.
  • November 2011: We figure out the right parameters for digital normalization, turning up the awesomeness level dramatically.
  • through present: focus on actually writing this stuff up. See: de Bruijn graph preprint; digital normalization preprint.

If you read this timeline (yeah, I know it's long, just skim) and look at the dates of "public disclosure", there's a 12-14 month gap between talking about k-mer counting (July 2010) and partitioning/etc (Oct 2011, metagenome informatics talk). And then there's another several-month gap before I really talk about digital normalization as a good solution (basically, mid/late January 2012).

Why??

  1. I was really freakin' busy actually getting the stuff to work, not to mention teaching, traveling, and every now and then actually being at home.
  2. I was definitely worried about "theft" of ideas. Looking back, this seems a mite ridiculous, but: I'm junior faculty in a fast-moving field. Eeek! I also have a duty to my grads and postdocs to get them published, which wouldn't be helped by being "scooped".
  3. We kept on coming up with new solutions and approaches! Digital normalization didn't exist until August 2011, for example; appropriate de-suckifying of Illumina data took until April or May of 2011; and proving that it all worked was, frankly, quite tough and took until October. (More on this below.)
  4. The code wasn't ready to use, and we hadn't worked out all the right parameters, and I wasn't ready to do the support necessary to address lots of people using the software.

All of these things meant I didn't talk about things openly on my blog. Is this me falling short of "open science" ideals??

In my defense, on the "open science" side:

  • I gave plenty of invited talks in this period, including a few (one at JGI and one at UMD CBCB) attended by experts who certainly understood everything I was saying, probably better than me.
  • I posted some of these talks on slideshare.
  • all of our software development has been done on github, under github.com/ctb/khmer/. It's all open source, available, etc.

...but these are sad excuses for open science. None of these activities really disseminated my research openly. Why?

Well, invited talks by junior faculty like me are largely attended out of curiosity and habit, rather than out of a burning desire to understand what they're doing; odds are, the faculty in question hasn't done anything particularly neat, because if they had, they'd be well known/senior, right? And who the heck goes through other people's random presentations on slideshare? So that's not really dissemination, especially when the talks are given to an in group already.

What about the source code? The "but all my source code is available" dodge is particularly pernicious. Nobody, but nobody, looks at other people's source code in science, unless it's (a) been released, (b) been documented, and (c) claims to solve YOUR EXACT ACTUAL PROBLEM RIGHT NOW RIGHT NOW. The idea that someone is going to come along and swoop your awesome solution out of your repository seems to me to be ridiculous; you'd be lucky to be that relevant, frankly.

So I don't think any of that is a good way to disseminate what you've done. It's necessary for science, but it's not at all sufficient.

--

What do I think is sufficient for dissemination? In my case, how do you build solutions and write software that actually has an impact, either on the way people think or (even better) on actual practice? And is it compatible with open science?

  1. Write effective solutions to common problems. The code doesn't have to be pretty or even work all that well, but it needs to work well enough to run and solve a common problem.
  2. Make your software available. Duh. It doesn't have to be open source, as far as I can tell; I think it should be, but plenty of people have restrictive licenses on their code and software, and it gets used.
  3. Write about it in an open setting. Blogs and mailing lists are ok; SeqAnswers is probably a good place for my field; but honestly, you've got to write it all down in a nice, coherent, well-thought out body of text. And if you're doing that? You might as well publish it. Here is where Open Access really helps, because The Google will make it possible for people to find it, read it, and then go out and find your software.

The interesting thing about this list is that in addition to all the less-than-salutary reasons (given above) for not blogging more regularly about our stuff, I had one very good reason for not doing so.

It's a combination of #1 and #3.

You see, until near to the metagenome informatics meeting, I didn't know if partitioning or digital normalization really worked. We had really good indications that partitioning worked, but it was never solid enough for me to push it strongly as an actual solution to big data problems. And digital normalization made so much sense that it almost had to work, but, um, proving it was a different problem. Only in October did we do a bunch of cross-validation that basically convinced me that partitioning worked really well, and only in November did we figure out how awesome digital normalization was.

So we thought we had solutions, but we weren't sure they were effective, and we sure didn't have it neatly wrapped in a bow for other people to use. So #1 wasn't satisfied.

And, once we did have it working, we started to put a lot of energy into demonstrating that it worked and writing it up for publication -- #3 -- which took a few months.

In fact, I would actually argue that before October 2011, we could have wasted people's time by pushing our solutions out for general use when we basically didn't know if they worked well. Again, we thought they did, but we didn't really know.

This is a conundrum for open science: how do you know that someone else's work is worth your attention? Research is really hard, and it may take months or years to nail down all the details; do you really want to invest significant time or effort in someone else's research before that's done? And when they are done -- well, that's when they submit it for publication, so you might as well just read that first!

--

This is basically the format for open science I'm evolving. I'll blog as I see fit, I'll post code and interact with people that I know who need solutions, but I will wait until we have written a paper to really open up about what we're doing. A big part of that is trying to only push solid science, such that I don't waste other people's time, energy, and attention.

So: I'm planning to continue to post all my senior-author papers to arXiv just before their first submission. The papers will come with open source and the full set of data necessary to recapitulate our results. And I'll blog about the papers, and the code, and the work, and try to convince people that it's nifty and awesome and solves some useful problems, or addresses cool science. But I don't see any much point in broadly discussing my stuff before a preprint is available.

Is this open science? I don't really think so. I'd really like to talk more openly about our actual research, but for all the reasons above, it doesn't seem like a good idea. So I'll stick to trying to give presentations on our stuff at conferences, and maybe posting the presentations to slideshare when I think of it, and interacting with people privately where I can understand what problems they're running into.

What I'm doing is more about open access than open science: people won't find out details of our work until I think it's ready for publication, but they also won't have to wait for the review process to finish. While I'm not a huge fan of the way peer review is done, I accept it as a necessary evil for getting my papers into a journal. By the time I submit a paper, I'll be prepared to argue, confidently and with actual evidence, that the approach is sound. If the reviewers disagree with me and find an actual mistake, I'll fix the paper and apologize profusely & publicly; if reviewers just want more experiments done to round out the story, I'll do 'em, but it's easy to argue that additional experiments generally don't detract from the paper unless they discover flaws (see above, re "apologize"). The main thing reviewers seem to care about is softening grandiose claims, anyway; this can be dealt with by (a) not making them and (b) sending to impact-oblivious journals like PLoS One. I see no problem with posting the paper, in any of these circumstances.

Maybe I'm wrong; experience will tell if this is a good idea. It'll be interesting to see where I am once we get these papers out... which may take a year or two, given all the stuff we are writing up.

I've also come to realize that most people don't have the time or (mental) energy to spare to really come to grips with other people's research. We were doing some pretty weird stuff (sketch graph representations? streaming sketch algorithms for throwing away data?), and I don't have a prior body of work in this area; most people probably wouldn't be able to guess at whether I was a quack without really reading through my code and presentations, and understanding it in depth. That takes a lot of effort. And most people don't really understand the underlying issues anyway; those who do probably care about them sufficiently to have their own research ideas and are pursuing them instead, and don't have time to understand mine. The rest just want a solution that runs and isn't obviously wrong.

In the medium term, the best I can hope for is that preprints and blog posts will spur people to either use our software and approaches, or that -- even better -- they will come up with nifty new approaches that solve the problems in some new way that I'd never have thought of. And then I can read their work and build on their ideas. This is what we should strive for in science: the shortest round trip between solid scientific inspiration in different labs. This does not necessarily mean open notebooks.

Overall, it's been an interesting personal journey from "blind optimism" about openness to a more, ahem, "nuanced" set of thoughts (i.e., I was wrong before :). I'd be interested to hear what other people have to say... drop me a note or make a comment.

--titus

p.s. I recognize that it's too early to really defend the claim that our stuff provides a broad set of solutions. That's not up to me to say, for one thing. For another, it'll take years to prove out. So I'm really talking about the hypothetical solution where it is widely useful in practice, and how that intersects with open science goals & practice.

Syndicated 2012-04-08 00:07:59 from Titus Brown

Big Data Biology - why efficiency matters

I'm going to pick on Mick Watson today. (It's OK. He's just a foil for this discussion, and I hope he doesn't take it too personally.)

Mick made the following comment on my earlier Big Data Biology blog post:

I do wonder whether there is just a bit too much hand wringing about "big data".

For e.g., the rumen metagenomic data you mentioned above, I can assemble using MetaVelvet on our server in less than a day (admittedly it has 512Gb of RAM, but doesn't everyone?). I can count the 17mers in it using Jellyfish in a few hours.

So I just set the processes running, two days later, I have my analysis. What's the problem? Does it matter that you can do it quicker?

Big data doesn't really worry me.

...

I know I am being flippant, but really to me the challenge isn't the data, it's the biology. I don't care if it takes 2 hours, 2 days or 2 weeks to process the data.

Improve your computing efficiency by 100x, I don't care; improve your ability to extract biological information by 100x, then I'm interested :)

He makes one very, very, very good point -- who cares if you can run an analysis (whatever it is) and it doesn't provide any value? The end goal of my sequencing analysis is to provide insight into biological processes; I might as well just delete the data (an O(1) "analysis" operation, if one with a big constant in front of it..) if the analysis isn't going to yield useful information.

But he also seems to think that speed and efficiency of analyses doesn't matter for science. And I don't just think he's dead wrong, I know he's dead wrong.

This is both an academic point and a practical point. And, in fact, an algorithmic point.

The academic reason why efficient computation is good for science

The academic point is simple: our ability to do thorough exploratory analysis of a large sequencing data set is limited by at least four things. These four things are:

  1. Our ability to do initial processing on the data - error trimming and correction, and data summary (mapping and assembly, basically).

  2. The information available for cross-reference. Most (99.9%) of our bioinformatic analyses rely on homology (for inference of function) and annotation.

    (This is why Open Access of data is so freakin' important to us bioinformaticians. If you hide your database from us, it might as well not exist for all we care.)

  3. Statistics. We do a lot of sensitive signal analysis and multiple testing, and we are really quite bad at computing FDRs and other statistical properties. Each statistical advance is greeted with joy.

  4. The ability to complete computations on (1), (2), and (3).

Every 100gb data set takes a day to process. Mapping and assembly can take hours to days to weeks. Each database search costs time and effort (in part because the annotations are all in different formats). Each MCMC simulation or background calculation takes significant time, even if it's automated.

Inefficient computation thus translates to an economic penalty on science (in time, effort, and attention span). This, in turn, leads directly to science that is not as good as it could be (as do poor computational science skills, badly written software, inflexible workflows, opaque pipelines, and too quick a rush to hypotheses -- hey, look, a central theme to my blog posts!)

Anecdote: someone recently e-mailed us to tell us about how they could assemble a comparable soil data set to ours in a mere week and 3 TB of memory. Our internal estimates suggest that for full sensitivity, we need to do 5-10 assemblies of that data set (each with different parameters) followed by a similarly expensive post-assembly merging -- so, minimally, 6 weeks of compute requiring 3 TB of memory, full-time, on as many cores as possible. You've gotta imagine that there's going to be a lot of internal pressure to get results in less time (surely we can get away with only 1 assembly?) with less parameter searching (what, you think we can tell you which parameters are going to work?) and this pressure is going to translate to doing less in the way of data set exploration. (Never mind the actual economics -- since this data set would take about 1 week of sequencer time, and $10,000 or so, to generate today, I think they don't make sense either.)

I can point you to at least three big metagenome Illumina assembly papers where I know these computational limitations truncated their exploration of the data set. (Wait, you say there are only three? Well, I'm not going to tell you which three they are.)

The practical reason why efficient computation is good for science

This one's a bit more obvious, but, interestingly, Mick also treads all over it. He says "...I can assemble using MetaVelvet on our server in less than a day (admittedly it has 512 Gb of RAM, but doesn't everyone?"

Well, no, they don't.

We didn't have access to such a big server until recently. We had plenty of offers for occasional access, but when we explained that we needed them for a few weeks of dedicated compute (for parameter exploration -- see above) and also that no, we weren't willing to sign copyright or license for our software over to a national lab for that access, somewhat oddly a lot of the offers came to naught.

It turns out most people don't have access to such bigmem computers, or even big compute clusters; and when they do, those computers and clusters aren't configured for biologists to use.

Democratization of sequencing should mean democratization of analysis, too. Every year our next-gen sequence analysis course gets tons of applicants from small colleges and universities where the compute infrastructure is small and what does exist is overwhelmed by Monte Carlo calculations. Our course explicitly teaches them to use Amazon to do their compute -- with that, they can take that knowledge home, and spend small amounts of money to buy IaaS, or apply for an AWS education grant to do their analysis. We feel for them because we were in their situation until recently.

Expensive compute translates to a penalty on the very ability of many scientists and teachers to access computational science. (Insert snide comment on similar limitations in practical access to US education, health care, and justice).

The algorithmic reason why efficient computation is good for science

Assemblers kinda suck. Everyone knows it, and recent contests & papers have done a pretty good job of highlighting the limitations (see GAGE and Assemblathon). This is not because the field is full of stupid people, but rather because assembly is a really, really hard problem (see Nagarajan & Pop) -- so hard that really smart people have worked for decades on it. (In many ways, the fact that it works at all is a tribute to their brilliance.)

Advances in assembly algorithms have led to our current crop of assemblers, but assemblers are still relatively slow and relatively memory consumptive. Our diginorm paper benchmarks Trinity as requiring 38 hours in 42gb of RAM for 100m mouse mRNAseq reads; genome and metagenome assemblers require similar size resources, although the variance depends on the sample, of course. SGA and Cortex seem unreasonably memory efficient to me :), but I understand that they perform less well on things other than single genomes (like, say, metagenomic data) -- in part because the underlying data structures are targeted at specific features of their data.

What's the plan for the future, in which we will be applying next-gen sequencing to non-model organisms, evolutionary experiments, and entire populations of novel critters? These sequencing data sets will have different features from the ones we are used to tackling with current tech -- including higher heterozygosity and strong GC-rich biases.

I personally think the next big advances in assembly will come through the systematic application of sample- or sub-sample specific, compute-expensive algorithms like EMIRGE to our data sets. While perfect assembly may be a pipe dream, significant and useful incremental advances seem very achievable, especially if the practical cost of current assembly algorithms drops.

Not so parenthetically, this is one of the reasons I'm so excited about digital normalization (the general concept, not only our implementation) --

I bet more algorithmically expensive solutions would be investigated, implemented, and applied if memory and time requirements dropped, don't you?

Or if the data could be made less error-prone and simpler?

Or if the volume of data could be reduced without losing much information?

I will take one side of that bet...

---

Of course, I'm more than a wee bit biased on this whole topic. A big focus of my group has been in spending the last three years fighting the trend of "just use a bigger computer and it will all be OK". Diginorm and partitioning are two of the results, and a few more will be emerging soon. I happen to think it's incredibly important; I would have done something else with my time, energy, and money if not. Hopefully you can agree that it's important, even if you're interested in other things.

So: yes, computational efficiency is not the only thing. And it's a surprisingly convenient moving target; frequently, you yourself can just wait a few months or buy a bigger computer, and achieve similar results. But sometimes that attitude masks the fact that efficient computation can bring better, cheaper, and broader science. We need to pay attention to that, too.

And, Mick? I don't think I can improve your ability to extract biological information by 100x. On metagenomes, would 2-10x be a good enough start?

--titus

Syndicated 2012-04-06 13:36:38 from Titus Brown

What is digital normalization, anyway?

I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization. When I posted the paper the reaction was reasonably positive, but I haven't had much luck explaining why it's so awesome.

At the workshop, people were still confused. So I tried something new.

I first made a simulated metagenome by taking three genomes worth of data from the Chitsaz et al. (2011) paper (see http://bix.ucsd.edu/projects/singlecell/) and shuffling them together. I combined the sequences in a ratio of 10:25:50 for the E. coli sequences, the Staph sequences, and the SAR sequences, respectively; the latter two were single-cell MDA genomic DNA. I took the first 10m reads of this mix and then estimated the coverage.

You can see the coverage of these genomic data sets estimated by using the known reference sequences in the first figure. E. coli looks nice and Gaussian; Staph is smeared from here to heck; and much of the SAR sequence is low coverage. This reflects the realities of single cell sequencing: you get really weird copy number biases out of multiple displacement amplification.

Then I applied three-pass digital normalization (see the paper) and plotted the new abundances. As a reminder, this operates without knowing the reference in advance; we're just using the known reference here to check the effects.

http://ivory.idyll.org/permanent/raw-coverage.png

Coverage of genome read mix, calculated by mapping the mixed reads onto the known reference genomes.

http://ivory.idyll.org/permanent/norm-coverage.png

Coverage post-digital-normalization, again calculated by mapping the mixed reads onto the known reference genomes.

As you can see, digital normalization literally "normalizes" the data to the best of its ability. That is, it cannot create higher coverage where high coverage doesn't exist (for the SAR), but it can convert the existing high coverage into nice, Gaussian distributions centered around a much lower number. You also discard quite a bit of data (look at the X axes -- about 85% of the reads were discarded in downsampling the coverage like this).

When you assemble this, you get as good or better results than assembling the unnormalized data, despite having discarded so much data. This is because no low-coverage data is discarded, so you still retain as much overall covered bases -- just in fewer reads. To boot, it works pretty generically for single genomes, MDA genomes, transcriptomes, and metagenomes.

And, as a reminder? Digital normalization does this in fixed, low memory; in a single pass; and without any reference sequence needed.

Pretty neat.

--titus

Syndicated 2012-04-06 11:17:51 from Titus Brown

What is digital normalization, anyway?

I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization. When I posted the paper the reaction was reasonably positive, but I haven't had much luck explaining why it's so awesome.

At the workshop, people were still confused. So I tried something new.

I first made a simulated metagenome by taking three genomes worth of data from the Chitsaz et al. (2011) paper (see http://bix.ucsd.edu/projects/singlecell/) and shuffling them together. I combined the sequences in a ratio of 10:25:50 for the E. coli sequences, the Staph sequences, and the SAR sequences, respectively; the latter two were single-cell MDA genomic DNA. I took the first 10m reads of this mix and then estimated the coverage.

You can see the coverage of these genomic data sets estimated by using the known reference sequences in the first figure. E. coli looks nice and Gaussian; Staph is smeared from here to heck; and much of the SAR sequence is low coverage. This reflects the realities of single cell sequencing: you get really weird copy number biases out of multiple displacement amplification.

Then I applied three-pass digital normalization (see the paper) and plotted the new abundances. As a reminder, this operates without knowing the reference in advance; we're just using the known reference here to check the effects.

http://ivory.idyll.org/permanent/raw-coverage.png

Coverage of genome read mix, calculated by mapping the mixed reads onto the known reference genomes.

http://ivory.idyll.org/permanent/norm-coverage.png

Coverage post-digital-normalization, again calculated by mapping the mixed reads onto the known reference genomes.

As you can see, digital normalization literally "normalizes" the data to the best of its ability. That is, it cannot create higher coverage where high coverage doesn't exist (for the SAR), but it can convert the existing high coverage into nice, Gaussian distributions centered around a much lower number. You also discard quite a bit of data (look at the X axes -- about 85% of the reads were discarded in downsampling the coverage like this).

When you assemble this, you get as good or better results than assembling the unnormalized data, despite having discarded so much data. This is because no low-coverage data is discarded, so you still retain as much overall covered bases -- just in fewer reads. To boot, it works pretty generically for single genomes, MDA genomes, transcriptomes, and metagenomes.

And, as a reminder? Digital normalization does this in fixed, low memory; in a single pass; and without any reference sequence needed.

Pretty neat.

--titus

Syndicated 2012-04-05 02:59:51 from Titus Brown

Our approach to replication in computational science

I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction between replication and reproducibility is, from what I understand, that "replicable" means "other people get exactly the same results when doing exactly the same thing", while "reproducible" means "something similar happens in other people's hands". The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right.)

So what did we do to make this paper extra super replicable?

If you go to the paper Web site, you'll find:

  • a link to the paper itself, in preprint form, stored at the arXiv site;
  • a tutorial for running the software on a Linux machine hosted in the Amazon cloud;
  • a git repository for the software itself (hosted on github);
  • a git repository for the LaTeX paper and analysis scripts (also hosted on github), including an ipython notebook for generating the figures (more about that in my next blog post);
  • instructions on how to start up an EC2 cloud instance, install the software and paper pipeline, and build most of the analyses and all of the figures from scratch;
  • the data necessary to run the pipeline;
  • some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but not all) of the data and analyses underlying the paper from scratch, all on your own, on a machine that you can rent for something like 50 cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work. I've done my best to make sure the darn thing works, and this is the actual pipeline we ourselves ran to produce the figures in the paper. All the data is there, and all of the code used to process the data, analyze the results, and produce the figures is also there. In version control.

When you combine that with the ability to run this on a specific EC2 instance -- a combination of a frozen virtual machine installation and a specific set of hardware -- I feel pretty confident that at least this component of our paper is something that can be replicated.

A few thoughts on replicability, and effort

Why did I go to all this trouble??

Wasn't it a lot of work?

Well, interestingly enough, it wasn't that much work. I already use version control for everything, including paper text; posting it all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files were available. I didn't want to add the the 2gb E. coli resequencing data set to git, for example. So I ended up tarballing those files sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake everything from scratch; they were already needed to make the paper, so in order to post them all I had to do was put in a teensy bit of effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog post) was probably the most effort, because I had to learn how to use it, which took about 20 minutes. But it was one of the smoothest transitions into using a new tool I've ever experienced in my ~25 years of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I believe in replication and reproducibility, and wanted to see how tough it was to actually do something like this. (It's a good deal above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot in recent months by irreplicable results. I won't name names, but several Science and PNAS and PLoS One papers of interest to us turned out to be basically impossible for us to replicate. And, since we are engaged in developing new computational methods that must be compared to previous work, an inability to regenerate exactly the results in those other papers meant we had to work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets after publishing, under the mistaken belief that their submission to the Short Read Archive could be used to regenerate their results. (Often SRA submissions are unfiltered, and no one keeps the filtering parameters around...right?) In some cases, I got the right data sets from the authors and could replicate (kudos to Brian Haas of Trinity for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad computational scientists. My favorite example is a very high profile paper (coauthored by someone I admire greatly), in which the script they sent to us -- a script necessary for the initial analyses -- had a syntax error in it. In that case, we were fairly sure that the authors weren't sending us the script they'd actually used... (It was Perl, so admittedly it's hard to tell a syntax error from legitimate code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or unpublished software, or being hand-wavy about the parameters they used, or using version X of some Web-hosted pipeline for which only version Y was now available. Clearly these are long-term issues that need to be discussed with respect to replication in comp. bio., but that's another topic.)

Thus, my group has wasted a lot of time replicating other people's work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy for people to pick up this tool and use it. Digital normalization is super ultra awesome and I want as little as possible to stand in the way of others using it. So there's a strong element of self-interest in doing things this way, and I hope it makes diginorm more useful. (I know about a dozen people that have already tried it out in the week or so since I made the paper available, which is pretty cool. But citations will tell.)

What use is replication?

Way back when, Jim Graham politely schooled me in the true meaning of reproducibility, as opposed to replication. He was about 2/3 right, but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your code and "reproduce" your result. I'm not. First, I've got my own work to do. More importantly, the odds are that nobody will be any wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell you this is a dangerously wrong way to think, at least in the field of bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the authors themselves could replicate their own computational results -- too many files and parameters are missing. We call that "bad science".

Second, odds are, the senior professor has little or no detailed understanding of what bioinformatic steps were taken in processing the data, and moreover is uninterested in the details; that's why they're not in the Methods. Why is that a problem? Because the odds are quite good that many biological analyses hinge critically on such points. So the peer reviewers and the community at large need to be able to evaluate them (see this RNA editing kerfuffle for an excellent example of reviewer fail). Yet most bioinformatic pipelines are so terribly described that even with some WAG I can't figure out what, roughly speaking, is going on. I certainly couldn't replicate it, and generating specific critiques is quite difficult in that kind of circumstance.

Parenthetically, Graham does refer to the climate sciences struggles with reproducibility and replication. If only they put the same effort into replication and data archiving they did into arguing with climate change deniers...

Third, Graham may be guilty of physics chauvinism (just like I'm almost certainly guilty of bioinformatics chauvinism...) Physics and biology are quite different: in physics, you often have a theoretical framework to go by, and results should at least roughly adhere to that or else they are considered guilty until proven innocent. In biology, we usually have no good idea of what we're expecting to see, and often we're looking at a system for the very first time. In that environment, I think it's important to make the underlying computation WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in my PyCon talk (slide 5)), physics and biology are quite different in the way data is generated and analyzed. There's fewer sources of data generation in physics, there's more of a computational culture, and there's more theory. Having worked with physicists for much of my scientific life (and having published a number of papers with physicists) I can tell you that replication is certainly a big problem over there, but the consequences don't seem as big -- eventually the differences between theory and computation will be worked out, because they're far more noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on replicability of computational results) is a natural part of the transition to Big Data biology. The quality of analysis methods matters A LOT when you are dealing with massive data sets with weak signals and many systematic biases. (I'll write about this more later.)

Fifth, and probably most significant from a practical perspective, Graham misses the point of reuse. In bioinformatics, it behooves us to reuse proven (aka published) tools -- at least we know they worked for someone, at least once, which is not usually the case for newly written software. I don't pretend that it's the responsibility of people to write awesome reusable tools for every paper, but sure as heck I should expect to be able to run them on some combination of hardware and software. Often that's not the case, which means I get to reinvent the wheel (yay...) even when I'm doing the same stupid thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The analysis pipeline for the paper? It's not that great. But at least you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said something jokingly: "wow, you're making it way too easy for people!" -- presumably he meant it would be way to easy for people to criticize or otherwise complain about the specific way we're doing things. Then, a day or two later he said, "hmm, but now that I think of it, no one ever uses the software we publish, and you seem to have had better luck with that..." -- recognizing that if you are barely able to run your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write without the data sets from the Trinity paper and the Velvet-SC paper. Having those nice, fresh, well-analyzed data sets already at hand was fantastic. Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you should be programming for is yourself, in 6 months." That's also true in science -- I'm sure I won't remember the finer details of the diginorm paper analysis in 2 years -- but I can always go look into version control. More importantly, new graduate students can go look and really see what's going on. (And I can use it for teaching, too.) And so can other people working with me. So there's a lot of utility in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be more impressed by the argument that "replication isn't all that important" when I see lack of replication as the exception rather than the rule. Replication is essential, and good, and useful. I long for the day when it's not interesting, because it's so standard. In the meantime I would argue that it certainly doesn't do any harm to emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he is at worst usefully wrong on these points, and substantially correct in many ways. I'm just picking on him because he wrote it all down in one place for me to link to, and chose to use the word 'sic' when reproducing my spelling mistake. Low blow ;)

The future

I don't pretend to have all, or even many, of the answers; I just like to think about what form they might take.

I don't want to argue that this approach is a panacea or a high-quality template for others to use, inside or out of bioinformatics. For one thing, I haven't automated some of the analyses in the paper; it's just too much work for too little benefit at this point. (Trust me, they're easy to reproduce... :). For another, our paper used a fairly small amount of data overall; only a few dozen gigabytes all told. This makes it easy to post the data for others to use later on. Several of our next few papers will involve over a half terabyte of raw data, plus several hundred gb of ancillary and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just analyzed other people's data sets (an approach which for some reason isn't in favor in high impact bioinformatics, probably because high impact journal subs are primarily reviewed by biologists who want to see cool new data that we don't understand, not boring old data that we don't understand). There's no way we can or should argue that biological replicates done in a different lab should replicate the results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some other approach) to making their analyses replicable. I don't mind people rejecting good approaches because they don't fit; to each their own. But this kind of limited enabling of replication isn't that difficult, frankly, and even if it were, it has plenty of upsides. It's definitely not irrelevant to the practice of science -- I would challenge anyone to try to make that claim in good faith.

--titus

p.s. I think I have to refer to this cancer results not reproducible paper somewhere. Done.

Syndicated 2012-04-02 14:29:39 from Titus Brown

Our approach to replication in computational science

I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction between replication and reproducibility is, from what I understand, that "replicable" means "other people get exactly the same results when doing exactly the same thing", while "reproducible" means "something similar happens in other people's hands". The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right.)

So what did we do to make this paper extra super replicable?

If you go to the paper Web site, you'll find:

  • a link to the paper itself, in preprint form, stored at the arXiv site;
  • a tutorial for running the software on a Linux machine hosted in the Amazon cloud;
  • a git repository for the software itself (hosted on github);
  • a git repository for the LaTeX paper and analysis scripts (also hosted on github), including an ipython notebook for generating the figures (more about that in my next blog post);
  • instructions on how to start up an EC2 cloud instance, install the software and paper pipeline, and build most of the analyses and all of the figures from scratch;
  • the data necessary to run the pipeline;
  • some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but not all) of the data and analyses underlying the paper from scratch, all on your own, on a machine that you can rent for something like 50 cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work. I've done my best to make sure the darn thing works, and this is the actual pipeline we ourselves ran to produce the figures in the paper. All the data is there, and all of the code used to process the data, analyze the results, and produce the figures is also there. In version control.

When you combine that with the ability to run this on a specific EC2 instance -- a combination of a frozen virtual machine installation and a specific set of hardware -- I feel pretty confident that at least this component of our paper is something that can be replicated.

A few thoughts on replicability, and effort

Why did I go to all this trouble??

Wasn't it a lot of work?

Well, interestingly enough, it wasn't that much work. I already use version control for everything, including paper text; posting it all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files were available. I didn't want to add the the 2gb E. coli resequencing data set to git, for example. So I ended up tarballing those files sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake everything from scratch; they were already needed to make the paper, so in order to post them all I had to do was put in a teensy bit of effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog post) was probably the most effort, because I had to learn how to use it, which took about 20 minutes. But it was one of the smoothest transitions into using a new tool I've ever experienced in my ~25 years of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I believe in replication and reproducibility, and wanted to see how tough it was to actually do something like this. (It's a good deal above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot in recent months by irreplicable results. I won't name names, but several Science and PNAS and PLoS One papers of interest to us turned out to be basically impossible for us to replicate. And, since we are engaged in developing new computational methods that must be compared to previous work, an inability to regenerate exactly the results in those other papers meant we had to work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets after publishing, under the mistaken belief that their submission to the Short Read Archive could be used to regenerate their results. (Often SRA submissions are unfiltered, and no one keeps the filtering parameters around...right?) In some cases, I got the right data sets from the authors and could replicate (kudos to Brian Haas of Trinity for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad computational scientists. My favorite example is a very high profile paper (coauthored by someone I admire greatly), in which the script they sent to us -- a script necessary for the initial analyses -- had a syntax error in it. In that case, we were fairly sure that the authors weren't sending us the script they'd actually used... (It was Perl, so admittedly it's hard to tell a syntax error from legitimate code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or unpublished software, or being hand-wavy about the parameters they used, or using version X of some Web-hosted pipeline for which only version Y was now available. Clearly these are long-term issues that need to be discussed with respect to replication in comp. bio., but that's another topic.)

Thus, my group has wasted a lot of time replicating other people's work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy for people to pick up this tool and use it. Digital normalization is super ultra awesome and I want as little as possible to stand in the way of others using it. So there's a strong element of self-interest in doing things this way, and I hope it makes diginorm more useful. (I know about a dozen people that have already tried it out in the week or so since I made the paper available, which is pretty cool. But citations will tell.)

What use is replication?

Way back when, Jim Graham politely schooled me in the true meaning of reproducibility, as opposed to replication. He was about 2/3 right, but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your code and "reproduce" your result. I'm not. First, I've got my own work to do. More importantly, the odds are that nobody will be any wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell you this is a dangerously wrong way to think, at least in the field of bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the authors themselves could replicate their own computational results -- too many files and parameters are missing. We call that "bad science".

Second, odds are, the senior professor has little or no detailed understanding of what bioinformatic steps were taken in processing the data, and moreover is uninterested in the details; that's why they're not in the Methods. Why is that a problem? Because the odds are quite good that many biological analyses hinge critically on such points. So the peer reviewers and the community at large need to be able to evaluate them (see this RNA editing kerfuffle for an excellent example of reviewer fail). Yet most bioinformatic pipelines are so terribly described that even with some WAG I can't figure out what, roughly speaking, is going on. I certainly couldn't replicate it, and generating specific critiques is quite difficult in that kind of circumstance.

Parenthetically, Graham does refer to the climate sciences struggles with reproducibility and replication. If only they put the same effort into replication and data archiving they did into arguing with climate change deniers...

Third, Graham may be guilty of physics chauvinism (just like I'm almost certainly guilty of bioinformatics chauvinism...) Physics and biology are quite different: in physics, you often have a theoretical framework to go by, and results should at least roughly adhere to that or else they are considered guilty until proven innocent. In biology, we usually have no good idea of what we're expecting to see, and often we're looking at a system for the very first time. In that environment, I think it's important to make the underlying computation WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in my PyCon talk (slide 5)), physics and biology are quite different in the way data is generated and analyzed. There's fewer sources of data generation in physics, there's more of a computational culture, and there's more theory. Having worked with physicists for much of my scientific life (and having published a number of papers with physicists) I can tell you that replication is certainly a big problem over there, but the consequences don't seem as big -- eventually the differences between theory and computation will be worked out, because they're far more noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on replicability of computational results) is a natural part of the transition to Big Data biology. The quality of analysis methods matters A LOT when you are dealing with massive data sets with weak signals and many systematic biases. (I'll write about this more later.)

Fifth, and probably most significant from a practical perspective, Graham misses the point of reuse. In bioinformatics, it behooves us to reuse proven (aka published) tools -- at least we know they worked for someone, at least once, which is not usually the case for newly written software. I don't pretend that it's the responsibility of people to write awesome reusable tools for every paper, but sure as heck I should expect to be able to run them on some combination of hardware and software. Often that's not the case, which means I get to reinvent the wheel (yay...) even when I'm doing the same stupid thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The analysis pipeline for the paper? It's not that great. But at least you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said something jokingly: "wow, you're making it way too easy for people!" -- presumably he meant it would be way to easy for people to criticize or otherwise complain about the specific way we're doing things. Then, a day or two later he said, "hmm, but now that I think of it, no one ever uses the software we publish, and you seem to have had better luck with that..." -- recognizing that if you are barely able to run your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write without the data sets from the Trinity paper and the Velvet-SC paper. Having those nice, fresh, well-analyzed data sets already at hand was fantastic. Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you should be programming for is yourself, in 6 months." That's also true in science -- I'm sure I won't remember the finer details of the diginorm paper analysis in 2 years -- but I can always go look into version control. More importantly, new graduate students can go look and really see what's going on. (And I can use it for teaching, too.) And so can other people working with me. So there's a lot of utility in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be more impressed by the argument that "replication isn't all that important" when I see lack of replication as the exception rather than the rule. Replication is essential, and good, and useful. I long for the day when it's not interesting, because it's so standard. In the meantime I would argue that it certainly doesn't do any harm to emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he is at worst usefully wrong on these points, and substantially correct in many ways. I'm just picking on him because he wrote it all down in one place for me to link to, and chose to use the word 'sic' when reproducing my spelling mistake. Low blow ;)

The future

I don't pretend to have all, or even many, of the answers; I just like to think about what form they might take.

I don't want to argue that this approach is a panacea or a high-quality template for others to use, inside or out of bioinformatics. For one thing, I haven't automated some of the analyses in the paper; it's just too much work for too little benefit at this point. (Trust me, they're easy to reproduce... :). For another, our paper used a fairly small amount of data overall; only a few dozen gigabytes all told. This makes it easy to post the data for others to use later on. Several of our next few papers will involve over a half terabyte of raw data, plus several hundred gb of ancillary and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just analyzed other people's data sets (an approach which for some reason isn't in favor in high impact bioinformatics, probably because high impact journal subs are primarily reviewed by biologists who want to see cool new data that we don't understand, not boring old data that we don't understand). There's no way we can or should argue that biological replicates done in a different lab should replicate the results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some other approach) to making their analyses replicable. I don't mind people rejecting good approaches because they don't fit; to each their own. But this kind of limited enabling of replication isn't that difficult, frankly, and even if it were, it has plenty of upsides. It's definitely not irrelevant to the practice of science -- I would challenge anyone to try to make that claim in good faith.

--titus

p.s. I think I have to refer to this cancer results not reproducible paper somewhere. Done.

Syndicated 2012-04-02 02:40:28 from Titus Brown

Digital normalization of short-read shotgun data

We just posted a pre-submission paper to arXiv.org:

A single pass approach to reducing sampling variation, removing errors, and scaling de novo assembly of shotgun sequences

Authors: C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, and Timothy H. Brom

arXiv link

Paper Web site, with source code and tutorials

Abstract:

Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes enable the sensitive investigation of a wide range of biological phenomena. However, it is increasingly difficult to deal with the volume of data emerging from deep short-read sequencers, in part because of random and systematic sampling variation as well as a high sequencing error rate. These challenges have led to the development of entire new classes of short-read mapping tools, as well as new de novo assemblers. Even newer assembly strategies for dealing with transcriptomes, single-cell genomes, and metagenomes have also emerged. Despite these advances, algorithms and compute capacity continue to be challenged by the continued improvements in sequencing technology throughput. We here describe an approach we term digital normalization, a single-pass computational algorithm that discards redundant data and both sampling variation and the number of errors present in deep sequencing data sets. Digital normalization substantially reduces the size of data sets and accordingly decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs. In doing so, it converts high random coverage to low systematic coverage. Digital normalization is an effective and efficient approach to normalizing coverage, removing errors, and reducing data set size for shotgun sequencing data sets. It is particularly useful for reducing the compute requirements for de novo sequence assembly. We demonstrate this for the assembly of microbial genomes, amplified single-cell genomic data, and transcriptomic data. The software is freely available for use and modification.

---

I'll blog more about this stuff over the next few days, but, briefly, this paper presents a single pass, fixed-memory approach to downsampling sequencing data (yeah, the stuff I've been talking about for a while now). This approach is called "digital normalization", or "diginorm" for sure. It eliminates lots of data, evens out coverage, and removes errors from shotgun sequencing data sets. The net effect is an often massive amount of data reduction combined with significant scaling of de novo assembly.

Or, to put it another way, it's a streaming lossy compression algorithm that primarily "loses" errors in sequencing data.

We've implemented it as a preprocessing filter that should work on any data set you want to assemble, with potentially any assembler. It's written in C++, with a Python wrapper, as part of khmer. And, of course, it's freely available for use, re-use, modification, and redistribution under the BSD license, 'cause why the heck not?

If you want to try it out, we've linked to some tutorials for running microbial genome assemblies with Velvet, as well as eukaryotic transcriptome assemblies with Oases and Trinity, on the paper Web site

We'll submit this paper to PNAS on Friday; I'm still waiting for some final proofreading.

Together with our other paper, which is currently under revision for resubmission to PNAS, these two papers form the theoretical basis for our attack on soil metagenome assembly. (Our plan is sheer elegance in its simplicity!)

--titus

Syndicated 2012-03-22 01:38:07 from Titus Brown

465 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!