11 Dec 2012 apenwarr   » (Master)

Let's think about *small* data for a change

There's a lot of buzz lately about "big data" - huge, Internet scale databases that take a long time to query, but are awesome, I guess, because of how much testosterone you need to have in order to possess one.

Let's leave aside the question of exactly how big big data needs to be. I've heard of people talking about databases as small as a gigabyte as being "big," I guess because downloading it would take a few minutes, and maybe 'grep' isn't the best way to query it. Other people would say a terabyte is big, or a petabyte.

I don't really care. All I know is, beyond a certain threshold that depends on the current state of computer technology, as soon as your data is "big," queries are slow. Maybe a few seconds, which isn't so bad, but maybe a few *minutes*. And when things get slow, you start having to mess with having separate "data warehouse" servers so that doing your big analytical queries don't bring down your whole database. And managing it all becomes a full time job for someone or many people or a whole company.

I happen to work for an employer that does that sort of thing a lot. And to be honest, I find it pretty boring. Perhaps I have a testosterone deficiency. It's not so much the bigness that bothers me: it's the waiting. I like my compile-test-debug cycle to be on the order of two seconds, but when SQL or mapreduce gets involved, it's more like two minutes.1

I know, cry for me, right? Two minutes of my life, all gone. But seriously, when you're trying to find trends and aren't quite sure what you're looking for and it takes a dozen tries, those two minutes can add up rapidly, especially when added to the 10 minutes of web browsing or email that inevitably ensues once I get bored waiting for the two minutes.

After worrying about this problem for a long time (years now, I guess), I think I've come up with a decent workaround. The trick is to divide your queries into multiple stages. At each stage, we reduce the total amount of data by a few orders of magnitude, and thus greatly decrease the cost of debugging a complex query.

Originally, I might have tried to make a single SQL query that goes from, say, a terabyte of data, down to 10 rows, and made a bar chart. Which 10 rows? Well, it takes a few tries to figure that out, or maybe a few hundred tries. But it's much faster if we set the initial goal instead to, say, only pull out the gigabyte I need from that terabyte. And then I can query a gigabyte instead. From there, I can reduce it to a megabyte, perhaps, which is easy enough to process in RAM without any kind of index or complexity or optimization.

That last part is what I want to talk to you about, because I've been working on a tool to do it. I call it afterquery, tagline: "The real fun starts after the Serious Analysts have gone home." Serious people write mapreduces and SQL queries. Serious people hire statisticians. Serious people have so much data that asking questions about the data requires a coffee break, but they get paid so much they don't have to care.

Afterquery is the opposite of Serious. It downloads the whole dataset into RAM on your web browser and processes it in javascript.

But what it lacks in seriousness, it makes up for in quick turnaround. Here's what it produces from 1582 rows of data I got from Statistics Canada:2