16 Mar 2011 apenwarr   » (Master)

Parsing ought to be easier

I just realized that I've spent too much time writing about stuff I actually know about lately. (Although okay, that last article was a stretch.) So here's a change of pace.

I just read an interesting article about parsing, that is, Parsing: The Solved Problem that Isn't. It talks about "composability" of grammars, that is, what it would take to embed literal SQL into your C parser, for example.

It's a very interesting question that I hadn't thought of before. Interesting, because every parser I've seen would be *hellish* to try to compose with another grammar. Take the syntax highlighting in your favourite editor and try to have to it auto-shift from one language to another (PHP vs. HTML, or python vs. SQL, or perl vs. regex). It never works. Or if you're feeling really brave, take the C++ parser from gcc and use it to do syntax highlighting in wordpress for when you insert some example code. Not likely, right?

The article was a nice reminder of what I had previously learned about parsing in school: context free grammars, LL(k), and so on. Before I went to school, I had never heard of or used any of those things, but I had written a lot of parsers; I now know that what I had independently "invented" is called recursive descent and it seems to be pretty unfashionable among people who "know" parsing.

I admit it; I'm a troublemaker and I fail badly at academia. I still always write all my parsers as recursive descent. Sometimes I even don't split the tokenizer from the grammar. Booyah! I even write non-conforming XML parsers sometimes, and use them for real work.

So if you're a parsing geek, you'd better leave now, because this isn't going to get any prettier.

Anyway, here's my big uneducated non-academic parsing neophyte theory:

You guys spend *way* too much time caring about arithmetic precedence.

See, arithmetic precedence is important; languages that don't understand it (like Lisp) will never be popular, because they prevent you from writing what you mean in a way that looks like what you mean. Fine. You've gotta have it. But it's a problem, because context-free grammars (and its subsets) have a *seriously hard time* with it. You can't just say "addition looks like E+E" and "multiplication looks like E*E" and "an expression E is either a number or an addition or a multiplication", because then 1+2*3 might mean either (1+2)*3 or 1+(2*3), and those are two different things. Every generic parsing algorithm seems to require hackery to deal with things like that. Even my baby, recursive descent, has a problem with it.

So here's what I say: just let it be a hack!

Because precedence is only a tiny part of your language, and the *rest* is not really a problem at all.

When I write a parser that cares about arithmetic precedence - which I do, sometimes - the logic goes like this:

  • ah, there's the number one
  • a plus sign!
  • the number two! Sweet! That's 1+2! It's an expression!
  • a multiplication sign. Uh oh.
  • the number three. Hmm. Well, now we have a problem.
  • (hacky hacky swizzle swizzle) Ta da! 1+(2*3).
I'm not proud of it, but it happens. You know what else? Almost universally, the *rest* of the parser, outside that one little hacky/swizzly part, is fine. The rest is pretty easy. Matching brackets? Backslash escapes? Strings? Function calls? Code blocks? All those are easy and non-ambiguous. You just walk forward in the text one token at a time and arrange your nice tree.

The dirty secret about parsing theory is that if you're a purist, it's almost impossible, but if you're willing to put up with a bit of hackiness in one or two places, it's *really* easy. And now that computers are fast, your algorithm rarely has to be even close to optimized.

Even language composition is pretty easy, but only in realistic cases, not generalized ones. If you expect this to parse:

	if (parseit) {
		query = "select "booga" + col1 from table where n="}"";
	}

Then you've got a problem. Interestingly, a human can do it. A computer *could* do it too. You can probably come up with an insane grammar that will make that work, if you want to allow for potentially exponential amounts of backtracking and no chance of separating scanning from parsing. (My own preferred recursive descent technique is almost completely doomed here, so you'll need to pull out the really big Ph.D. parsing cannons.) But it *is* possible. You know it is, because you can look at the above code and know what it means.

So that's an example of the "hard problems" that you're talking about when you try to define composability of independent context-free grammars that weren't designed for each other. It's a virtually impossible problem. An interesting one, but not even one that's guaranteed to have a generalizable solution. Compare it, however, with this:

	if (parseit) {
		query = { select "booga" + col1 from table where n = "}" };
	}

Almost the same thing. But this time, the SQL is embedded inside braces instead of quotes. Aha, but that n="}" business is going to screw us over, right? The C-style parser will see the close-brace and stop parsing!

No, not at all. A simple recursive descent parser, without even needing lookahead, will have no trouble with this one, because it will clearly know it's inside a string at the time it sees the closebrace. Obviously you need to be using a SQL-style tokenizer inside the SQL section, and your SQL-style tokenizer needs to somehow know that when it finds the mismatched brace, that's the end of its job, time to give it back to C. So yeah, if you're writing this parser "Avery style", you're going to have to be writing it as one big ugly chunk of C + SQL parser all in one. But it won't be any harder than any other recursive descent parser, really; just twice the size because it's for two languages instead of one.

So here's my dream: let's ignore the parsing purists for a moment. Let's accept that operator precedence is a special case, and just hack around it when needed. And let's only use composability rules that are easy instead of hard - like using braces instead of quotes when introducing sublanguages.

Can we define a language for grammars - a parser generator - that easily lets us do *that*? And just drop into a lower-level implementation for that inevitable operator precedence hack.

Pre-emptive snarky comments: This article sucks. It doesn't actually solve any problems of parsing or even present a design, it just complains a lot. Also, this type of incorrect and dead-end thinking is already well covered in the literature, it's just that I'm too lazy to send you a link to the article right now because I'm secretly a troll and would rather put you down than be constructive. Also, the author smells funny.

Response to pre-emptive snarky comments: All true. I would appreciate those links though, if you have them, and I promise not to call you a troll. To your face.

Syndicated 2011-03-16 02:13:11 from apenwarr - Business is Programming

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!