15 Apr 2008 (updated 15 Apr 2008 at 08:38 UTC)
»
Kickfire and "Stream Processing"
I noticed Robert's
post about the Kickfire launch. He mentioned Truviso — for whom I
work — so I thought I'd
add my two cents.
Kickfire is the company previous known as "C2App". I'm not familiar
with the details of their technology, but the basic idea is
to use custom hardware to accelerate data warehousing
queries (this blog
post has some more details). Using custom hardware is
not a new idea —
Netezza have been
doing something superficially similar for years, with
considerable success. In addition to custom hardware,
Kickfire apparently use a few other data warehousing
techniques that have recently come back in vogue
(e.g. column-wise storage with compression, coupled with the
ability to do query execution over compressed data). As an
aside, I think
that building a data warehousing product using MySQL is a
fairly surprising technical decision.
One thing I did notice is that Kickfire's PR mentions
"stream processing" repeatedly, and Robert's post suggests
that the sort of stream processing done by Kickfire is
similar to what Truviso does. This
is not the case: the two companies and their products are
very different. I'd guess that Kickfire are using the
term because it's become something of a buzzword.
I'd like to talk more about Truviso on this blog in the
future, but the basic idea behind data stream processing is
to allow
analysis queries to be performed over live streams of
data, as the data arrives at the system. In traditional
databases, in order to apply a query to a piece of
data, you first
need to insert the data item into the database, wait for it
to be committed to disk (force-write the write-ahead log),
and then finally
run a query on it from scratch. When data arrives at a rapid
pace and you need low-latency query results, this
"store-and-query" model has terrible performance; it's also
an unnatural way to structure a client application (you're
essentially polling for results). Instead,
a data stream
query processor allows the user to define a set of
long-running continuous queries that represent the
conditions of interest over the incoming data streams. As
new live data arrives, the data is applied to the queries to
incrementally update their results; client applications can
simply consume new query results as soon as they become
available. This allows you to get
query results that are always up-to-date, without the need
to first
write data to disk (the data can either be discarded, or
else written to disk asynchronously). For certain domains,
such as algorithmic
trading, network and environment monitoring, fraud
detection, and real-time reporting, the data stream approach
often yields much better performance and a more natural
programming model. For more info, see the talk on
data stream query processing I gave at last year's PgCon.
So what does this have to do with using custom hardware to
accelerate data warehousing queries? Not a whole lot. I'm
guessing that Kickfire have co-opted the "stream processing"
label because they push analysis queries down to the custom
chip, and then "stream" the stored data over the chip, to
compute multiple queries in a single pass. If you squint at
it right, there are some similarities to stream query
processing (in both cases, you only want to take one pass
over the data), but fundamentally, Kickfire is trying to
solve a very different problem, and using a very different
set of technologies. Data warehouse engines like Kickfire
(and Greenplum) are
complements to data stream systems like Truviso (and
Streambase, Coral8, and others), not
supplements or competitors.