23 Aug 2016 joolean   » (Journeyer)

gzochi

gzochi 0.10 is out, and it's a big release, befitting its double-digit minor version, I hope. In particular, it includes an initial version of the meta server, which, as its name suggests, provides services to a cluster of gzochi application servers. In this first iteration, the meta server manages R/W locks on portions of a game's object graph, and feeds temporary copies of the relevant object data to the application server nodes to fuel task execution. In the future, it'll also be responsible for client load balancing and coordinating channel message delivery across the cluster.

I've been working on this release since the beginning of the year, although much of that time was spent brooding over the way these systems work in Project Darkstar and in factoring out some core components of the gzochid container to make them available to the meta server (including a kind of minimal dependency injection framework built on top of GObject - surprised no one's done that before). In the design I ultimately settled on, the native, memory-backed storage engine plays a key role, acting as a cache for object data received from the meta server, which maintains the canonical stores of game data. And so I also spent a lot of cycles examining the performance and concurrency profile of that system, tweaking lock orderings and tree traversal algorithms.

One of the more surprising things I discovered was that locks in the memory storage engine as of 0.9 were too granular, and that this made the store unusually prone to deadlock. As I mentioned in an earlier post, the architecture of the memory storage engine was based on my reading of the Berkeley DB source code - especially the B+tree implementation. There's a lot of stuff in BDB, though, and it's got plenty of features that I was pretty sure I could ignore while building my own transactional data store. One thing I decided to leave out of my implementation was pages. After all, if I didn't need to transfer this data to or from durable storage, then why bother partitioning it into disk bandwidth-friendly chunks? It turns out, though, that pages serve another valuable purpose: They limit the granularity of locks and force concurrent transactions to serialize their writes when accessing keys within some threshold distance, especially with respect to insertions of new key-value pairs. Take the following example:
  1. Transaction 1 attempts to update key a1 (write lock a1)
  2. Transaction 2 attempts to read key a2 (read lock a2)
  3. Transaction 2 attempts to insert key a3 (write lock a3)
  4. Transaction 1 attempts to update key a2 (write lock a2)

In principle, there's nothing about this lock ordering that should lead to deadlock. The two transactions are accessing key a2 in incompatible ways, but that just means that the second transaction might need to wait for the first to complete before it can get a write lock on a2. However, in a B+tree, the interstitial structural nodes - not just the "leaf" key-values - are subject to read/write locking when structural modifications such as inserts are made to the tree. So the sequence of locks above actually means:
  1. Transaction 1 attempts to update key a1 (write lock a1, read lock parent a)
  2. Transaction 2 attempts to read key a2 (read lock a2, read lock parent a)
  3. Transaction 2 attempts to insert key a3 (write lock a3, write lock parent a)
  4. Transaction 1 attempts to update key a2 (write lock a2, read lock parent a)

With this additional bit of context, it becomes clear how the contention between these transactions can lead to deadlock: The first transaction can't make progress until it can add get a write lock on key a2, which the second transaction won't release until it can get a write lock on interstitial parent node a to add the key a3 to it. This pattern of clustered reads and insertions is quite common for gzochi application tasks in which some data from the game object graph is read or updated and then another task is durably scheduled for execution. But this is also where pages can help! By grouping regions of an ordered keyspace into "chunks" of keys that have to be locked in bulk, the granularity of interleaved access between concurrent transactions is effectively capped, and transactions attempting to update or insert proximal keys are forced to serialize. Bad for concurrency, but ultimately good for performance in a latency-sensitive environment where retrying a transaction from scratch hurts. Page locking makes the following change to the lock sequence above:
  1. Transaction 1 attempts to update key a1 (write lock page a)
  2. Transaction 2 attempts to read key a2 (read lock page a)
  3. Transaction 2 attempts to insert key a3 (write lock page a)
  4. Transaction 1 attempts to update key a2 (write lock page a)

In this scenario, because the first transaction's update attempt requires a write lock on the entire page, the second transaction can't acquire that toxic read lock. The practical result is that the two transactions execute in serial, like so:
  1. Transaction 1 attempts to update key a1 (write lock page a)
  2. Transaction 1 attempts to update key a2 (write lock page a)

[Transaction 1 commit: unlock page a]
  1. Transaction 2 attempts to read key a2 (read lock page a)
  2. Transaction 2 attempts to insert key a3 (write lock page a)

After making the switch to pages in the memory storage engine, the rate of deadlock for these kinds of transaction workflows dropped sharply, and is now roughly the same as that of the BDB-based storage engine. ...Which stands to reason.

Check out the release! I'm proud of it.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!