Advogato: Blog for nconway

grep on steroids

Recently, I've been spending some time poking around in the internals of the PostgreSQL query planner. One of the projects I've been thinking about is fixing a long-standing ugly part of the planner: the fact that it scribbles on its input.

(Context: the planner takes a Query, which is essentially an annotated, rewritten abstract syntax tree describing the query to be executed. It produces a Plan, which describes the particular way to execute the query that the planner believes to be most efficient. The problem is that the planner modifies various parts of the input Query. This is a problem for a few reasons; besides being ugly, (a) it doesn't clearly distinguish between the parser's input and its working state (b) it makes it impossible to pass the same Query to the planner multiple times without copying it each time. I'm a little embarrassed to admit we actually do make extra copies of a Query in a few places in the backend, for precisely this reason.)

So, the first thing I wanted to determine was: in what situations does the planner modify one of the fields of the input Query?

Um, so it seems this isn't a trivial problem to solve. The planner is about 30,000LOC, and the input Query is used all over the place. I also want to find indirect assignments to the input Query — for example, if the planner passes a pointer to a field of the input Query to a function, and the function then procedes to write through the pointer. (In this particular example, I have a pretty good idea where the most egregious offenders are, so I can probably solve the problem well-enough via simple text search using grep, glimpse, and the like — but a general-case solution would be interesting, I think.)

It might be plausible to do this via gdb, valgrind or the like (waiting until a write is actually made to the input Query at runtime and noting the call site then). But this only catches the modifications that happen to be made by a particular invocation of the planner, not all the modifications that might occur. It is also a hack: this information is statically derivable from the source code, so a static analysis seems much nicer than solving the problem at runtime.

Text search that does not incorporate knowledge of the syntax of the source code simply doesn't cut it. One example among many is:

void some_function(SomeType *st) {
    st->field = xxx;
}
Plan *planner(Query *query) {
    some_function(&(query->some_field));
}

Solving the problem in the general case also requires considering aliasing; alias analysis (in C anyway) is a rather complex problem to solve effectively, even in a compiler.

Besides this, there are interesting queries about source code that can't easily be expressed via searching for patterns in text — "show me the functions where function X is called before function Y has been called", "show me the functions that are only invoked by other functions defined in the same file (i.e. these would be candidates for being marked static)", and so on. ISTM a syntax-aware source code search tool like this would be an interesting thing to write (of course, if anyone knows of an F/OSS tool that already does this, let me know).

17 Mar 2005 nconway » (Master)