The Wayback Machine - https://web.archive.org/web/20170629050710/http://www.advogato.org/person/nconway/diary.html?start=47

Older blog entries for nconway (starting at number 47)

Filesystem Replication and Software Quality

A few years ago, I did a summer internship with a group at Microsoft that was building a multimaster filesystem replication product. This was a very rewarding experience for several reasons. Now that the replication product has been shipped (in Windows 2003-R2, Vista, and Windows Live Messenger), I was happy to see that my mentor for that summer, Nikolaj Bjørner, has published a paper containing "lessons learned" from the project: "Models and Software Model Checking of a Distributed File Replication System". The paper is worth reading, for a few reasons:

  1. Why is filesystem replication such a hard problem, particularly in the asynchronous, multi-master case?
    The paper talks about the basic problem and the approach the group took to solving it.
  2. Perhaps more interestingly, how do you go about constructing a high-quality implementation of such a product?
    I was impressed by the group's emphasis on correctness. Nikolaj and Dan (the technical lead for the group) both had a CS theory background, so this is perhaps not surprising -- but it's interesting to see some of the practical techniques that they used to ensure they built a correct replication system:
    • A detailed specification (on the order of a few hundred pages)
    • A prototype of the system in OCaml, written concurrently with the specification but before the real implementation work began
    • A high-level, executable specification of the replication protocol in AsmL. This served as both a readable description of the protocol, as well as a way to automatically generate useful test cases.
    • Using model checking to verify the correctness of certain particularly complex aspects of the protocol (distributed garbage collection, conflict resolution).
    • A "simulator" that walked a random tree of filesystem operations, pausing after each node to verify that the system had correctly replicated the resulting filesystem state. Once a leaf node in the tree was reached, the simulator then backtracked, exploring another branch of the tree. The simulator was also clearly inspired by model checking techniques. By replacing certain components of the real system with virtualized ones (e.g. using a toy in-memory database), this tool could be used to test large numbers of scenarios very quickly.
    • Exhaustive testing. Using the simulator and a cluster of test machines, more than 500 billion test cases were examined.
Jim Gray Tribute

On May 31, 2008, a tribute to honor the life and work of Jim Gray will be held at UC Berkeley. There's a technical session, for which registration is required, preceded by a general session that is open to the public. As the invitation email I received (thanks Elein!) states:

This is not a memorial, because Jim is still listed as missing, and will be so listed until about Jan 28, 2011. It is important that it is not referred to as a memorial, because it can't be a memorial until then. We believe that it is good to go ahead and recognize Jim's contributions, to honor him in a Tribute, before such a long time has passed.
9 Nov 2007 (updated 9 Nov 2007 at 07:42 UTC) »
Stonebraker on Databases for "Big Science"

There's a new post by Stonebraker up at The Database Column. I don't have much to add to the post itself, although it's interesting to hear some information about the old Sequoia 2000 project. I notice that Google and Yahoo were invited to the workshop — at first glance, it seems to me that the data management problems faced by the big web companies are quite dissimilar to the challenges facing "big science", but perhaps that's not the case.

22 Oct 2007 (updated 22 Oct 2007 at 20:28 UTC) »
PostgreSQL Conference Fall 2007

This weekend's conference in Portland was a great experience. Much thanks to Selena Deckelmann, Josh Drake, and all the other volunteers for organizing and running the conference. Everything ran amazingly smoothly!

I've posted the slides to my talk on "Query Execution Techniques in PostgreSQL". I thought the talk went fairly well, although unfortunately I didn't have enough time to get to everything I wanted to discuss.

In the talk, one of the algorithms I discussed was the "hybrid hash join", which is the common hash join algorithm used by most modern DBMSs, including PostgreSQL. The night before, Jeff Davis tipped me off to the fact that the inventor of the hybrid hash join algorithm, Dr. Len Shapiro from PSU, was going to be in the audience! Thankfully I didn't get the details of the hybrid hash join wrong :) It was a pleasure to meet Dr. Shapiro, whose students are doing some interesting work improving hash index bulk build performance.

Note To Self

When running Postgres in EXEC_BACKEND mode on Linux, make sure to do:

echo 0 > /proc/sys/kernel/randomize_va_space

Preferably, before spending a while wondering about the non-deterministic process startup errors that will otherwise occur.

PgCon

I gave a revised version of the "Introduction to Hacking PostgreSQL" tutorial at PgCon earlier today. I've posted the slides, handouts, and example patch here; this version uses a completely new example patch, and much of the introductory material has been revised. You can also find the slides at the PgCon page for the talk, which also includes a link to give me feedback.

23 Apr 2007 (updated 23 Apr 2007 at 21:43 UTC) »
School

I've been busy at school recently, finishing my undergraduate thesis and taking my final exams. I've got one more exam on Thursday, but once that's finished, I should be pretty much finished my undergraduate degree (fingers crossed).

In a complexity theory class I'm taking, I recently gave a talk on Kolmogorov complexity, introducing the basic ideas and discussing some notable applications. I'm no expert at AIT, so take the contents with a grain of salt.

Work

I've decided to take a full-time position at Amalgamated Insight, working on their PostgreSQL-based data stream management system (DSMS). I interned at AI last summer and I was very impressed, so I'm excited at the chance to work for them again. This also means I'll be moving to the Bay Area in early June.

Speaking of stream processing, I'll be giving a talk about data stream query processing at PgCon 2007. Gavin and I are also doing an introduction to modifying the PostgreSQL source code, similar in spirit to the "Introduction to Hacking PostgreSQL" tutorial we gave at the Anniversary Summit. The details are still a little hazy, but the goal is to make the session rewarding to both newcomers and to those who attended the previous tutorial.

Complexity and Life

The Kolmogorov complexity of a string x is the length of the description of the minimal computer program that outputs x, relative to a given model of computation. Hence, complexity describes the amount of redundant information, or structure, in the string: as the string's complexity approaches its length, the string becomes increasingly random. (A string's complexity can be no greater than an additive constant of its length, since we can always emit the string by embedding it in a Turing machine that just prints out the string).

I'd like to write more about Kolmogorov complexity later, but for now I'll just point to an interesting application of this idea to studying biological life. Zhanna Reznikova has applied information theory to studying the communication between animals, in particular ants. This paper describes an interesting result: an experiment was setup in which a "scout" ant communicates the location of some food to another group of "forager" ants. The path from the ants to the food is a binary maze, so the path can be represented as a series of left or right turns (e.g. "RRR", "RLR", etc.). Reznikova measured the time it took for the scout to transmit the location of the food to the other ants, and then varied the path, to see how the time of transmission varied with the complexity of the route. As it turns out (p. 9, table 2), the ants were able to more efficiently communicate paths that could be encoded as strings with low complexity. For example, the path "RRRRRR" required a mean duration of 88 seconds to transmit, whereas "RLRLRL" took 135 seconds.

13 Mar 2007 (updated 13 Mar 2007 at 05:21 UTC) »
Beautiful Concurrency

Simon Peyton-Jones and Simon Marlow's work on software transactional memory offers a very interesting alternative to traditional lock-based or lock-free approaches to implementing concurrent programs. Their paper "Composable Memory Transactions" is a fascinating read: it isn't often that you see research that is simultaneously so practical and so beautiful.

I had read, and been impressed by, the STM paper a year or so ago. So I was happy to notice recently that:

  1. O'Reilly is publishing a book of essays by prominent software developers called Beautiful Code -- about, well, beautiful code. That should almost certainly be worth reading:
  2. Simon PJ's chapter in that book, "Beautiful Concurrency", is available online, and offers a more introductory-level treatment of the ideas behind STM.

Hat tip: Tim Bray, LtU

9 Mar 2007 (updated 9 Mar 2007 at 05:31 UTC) »

"Hello, World Considered Harmful" Considered Harmful

Ralph Westfall's "Hello, World Considered Harmful" (CACM, 2001) begins by making a sensible point: the classic K&R example program "Hello, world" is not a very useful example for an introductory object-oriented programming course in Java:

public class HelloWorld
{
    public static void main(String[] args)
    {
        System.out.println("hello, world");
    }
}

Fair enough: "Hello, world" is fundamentally procedural, and doesn't teach students anything about how object-oriented programming (it might also teach them that Java is a bondage-and-discipline language that forces them to endure plenty of unnecessarily verbose syntax, but they are going to learn that eventually regardless).

Unfortunately, Westfall's solution to the problem falls short of the mark:

class HelloWorld
{
    public static void printHello()
    {
        System.out.println("hello, world");
    }
}
public class UseHello { public static void main(String[] args) { HelloWorld myHello = new HelloWorld(); myHello.printHello(); } }

Now, we can raise a few minor objections: for example, it is pointless to create an instance of a class that contains only static methods, and it is bad style to invoke a static method on an instance, IMHO.

But the essential point is that Westfall's solution merely applies some irrelevant syntactic sugar. The reason that "hello, world" isn't a good example of object-oriented programming is because it isn't object-oriented in the first place. The intuitive appeal of object-oriented programming stems from the fact that it is similar to objects in the "real world", which combine data and behavior. An instance of the HelloWorld object has neither data nor behavior. In fact, the modified example serves mostly to showcase an annoying property of Java: the dependence on static methods as a way of emulating global functions.

So why not apply a little creativity, and use a fresh example that actually illustrates the intuitive appeal of objects? Model a few real-world objects as classes, and give them some behavior: for example, simple geometric shapes like Triangle and Circle, which can then eventually be developed into a lesson on inheritance.

(Another improvement would be to use a better teaching language than Java, but that's another story.)

38 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!