21 Feb 2008 (updated 21 Feb 2008 at 08:38 UTC)
»
Data Management for RDF
I was talking to a database researcher recently about why
the artificial intelligence community and the database
community haven't historically seen eye-to-eye. The
researcher's opinion was that AI folks tend to regard
databases as hopelessly limited in their expressive power,
whereas DB folks tend to view AI data models as hopelessly
difficult to implement efficiently. There is probably some
truth to both views.
I was reminded of this when doing some reading about data
management techniques for RDF
(the proposed data model for the Semantic Web). Abadi et
al.'s "Scalable
Semantic Web Data Management Using Vertical
Partitioning" is a nice paper from VLDB 2007, and
appears to be one of
a relatively small group of papers that approach the
Semantic Web from a database systems perspective. The paper
proposes a new model for storing RDF data, which essentially
applies the column-store ideas from the C-Store
and Vertica
projects. Sam Madden and Daniel Abadi talk about their ideas
more in a blog
entry at The Database Column.
Planet PostgreSQL readers might be interested in this
observation in the paper:
We chose Postgres as the row-store to experiment with
because Beckmann
et al. experimentally showed that it was by
far more efficient dealing with sparse data than commercial
database products. Postgres does not waste space storing
NULL data: every tuple is preceded by a bit-string of
cardinality equal to the number of attributes, with '1's at
positions of the non-NULL values in the tuple. NULL data is
thus not stored; this is unlike commercial products that
waste space on NULL data. Beckmann et al. show that Postgres
queries over sparse data operate about eight times faster
than commercial systems
(A minor nitpick: Postgres will omit the per-tuple NULL
bitmap when none of the attributes of a tuple are NULL, so
it is not quite true that "every tuple is preceded by
a bit-string".)
The cited Beckman et al. paper is "Extending
RDBMSs To Support Sparse Datasets Using An Interpreted
Attribute Storage Format".
It's interesting that none of the leading commercial systems
seem to use exactly the same NULL bitmap approach that
Postgres does. The tradeoff appears to be of storage against
computation time: eliding the NULL values from the on-disk
tuple reduces storage requirements, but makes it more
expensive to find
the offset within a tuple at which an attribute begins, if
the attribute is preceded by one or more (elided) NULL
values. If NULL values
were stored in the on-disk tuple (and no variable-width
attributes are used), the offset of an attribute can be
found more efficiently.
In practice, Postgres implements another optimization that
mitigates this problem to some extent: as tuples are passed
around the executor and attributes are "extracted" from the
on-disk tuple representation, they are effectively cached
using the TupleTableSlot mechanism. This means that
the computation to find the right offset for an attribute in
the presence of NULLs is typically only done at most once
per attribute of a tuple.