Advogato: Blog for joolean

Transactional B-trees

The next version of gzochi is going to include a new storage engine implementation that keeps data entirely in memory while providing transactional guarantees around atomicity of operations and visibility of data. There are a couple of motivations for this feature. The primary reason is to prepare the architecture for multi-node operation, which, as per Blackman and Waldo's technical report on Project Darkstar, requires that the game server -- which becomes, in a distributed configuration, more like a task execution server -- maintain a transient cache of game data delivered from a remote, durable store of record. The other is to offer an easier mechanism for "quick-starting" a new gzochi installation, and to support users who, for political or operational reasons, prefer not to bundle or install any of the third-party databases that support the persistent storage engines.

That first motivation wouldn't bias me much in either direction on the build-vs-buy question; Project Darkstar's authors almost certainly (planned) to implement this using Berkeley DB's "private" mode (example here). However, gzochi is intentionally agnostic when it comes to storage technology. The database that underlies a storage engine implementation needs only to support serializably isolated cursors and reasonable guarantees around durability; requiring purely in-memory operation would be a heavy requirement. And I feel too ambivalent about Oracle to pin the architecture to what BDB supports, AGPL or no. (The Darkstar architects should have been a bit warier themselves.) So I settled on the "build" side of the balance. ...Although my first move was to look for some source code to steal. And I came up weirdly short. The following is a list of the interesting bits and dead ends I came across while searching for transaction B-tree implementations.

Some more specific requirements: There are two popular flavors of concurrency for the kind of data structure I wanted to build with the serializable level of transactional isolation I wanted to provide. Pessimistic locking requires that all access to the structural or data content of the tree by different agents (threads, transactions) be done under the protection of an explicit read or write lock, depending on the nature of the access. Optimistic locking often comes in the form of multi-version concurrency control, and offers each agent a mutable snapshot of the data over the lifetime of a transaction, mostly brokering resolutions to conflicts only at commit time. Each approach has its advantages: MVCC transactions never wait for locks, which usually makes them faster. Pessimistic locking implementations typically detect conflicting access patterns earlier than optimistic implementations, which wait until commit to do so. Because gzochi transactions are short and fine-grained, and the user experience is sensitive to latency, I believe that the time wasted by unnecessary execution of "doomed" transactional code is better avoided via pessimistic locking. (I could be wrong.)

Here's what I found:

Apache Mavibot - Transactional B-tree implementation in Java using MVCC. Supports persistent and in-memory storage. Hard for me to tell from reading their source code how their in-memory mode could possibly work for multi-operation transactions.
Masstree - Optimistically concurrent, in-memory non-transactional B+tree implementation designed to better exploit SMP hardware.
Silo - Optimistically concurrent, in-memory transactional store that uses Masstree as its B-tree implementation.
SQLite - Lightweight SQL implementation with in-memory options, with a transaction-capable B-tree as the underlying storage. Their source code is readable, but the model of threads, connections, transactions, and what they call "shared cache" is hard to puzzle out. The B-tree accumulates cruft without explicit vacuuming. The B-tree code is enormous.
eXtremeDB - Commercial in-memory database with lots of interesting properties (pessimistic and MVCC modes, claimed latencies extremely low) but, you know, no source code. So.

Because I was unable to find any pessimistic implementations with readily stealable source code, I struck out on my own. It took me about a week to build my own pessimistic B+tree, using Berkeley DB's code and architecture as a surprisingly helpful guide. My version is significantly slower than BDB (with persistence to disk enabled) but I'm going to tune it and tweak it and hopefully get it to holler if not scream.

16 Feb 2015 joolean » (Journeyer)