3 Feb 2007 pesco   » (Journeyer)

Back from the dead, with a nonlinear parser

Soo, everything went well different than planned. What was supposed to be a holidy clean-up rewrite of a fun weekend project has turned into a half-year side project running next to university.

To recap, I initially set out to implement a Markdown[1] parser in Haskell so I could post formatted text to my Advogato blog. An email-to-Advogato gateway was quickly whipped up[2]. The first prototype version of a Markdown parser was also finished within reasonable time[3]. Unfortunately, the code was a mess, so I set out for the rewrite[4]. Much progress was made but it kept screwing up in certain minor but annoying cases and the code still looked convoluted. Basically, Parsec just didn't want to bend in the right direction...

So I replaced Parsec. The module is called Text.ParserCombinators.Nonlinear[5] because it allows one to slurp in parts of the document in one part of the parser and reparse them again later. This allowed me to split up the document according to its block-level structure and re-assemble, for instance, the text pieces of quoted or indented lines (without the leading quote marks/indetation) and run the corresponding parser over the thus extracted subdocument. Such embedded parses can also work with a completely different token type than the enclosing parser, a capability which also came in handy.

I recently came across "Frisby"[6], a Haskell implementation of PEG grammars, which I had never heard of before. The description sounds cool. I wonder if my Markdown variant could be represented by one? My parser library is neither optimized for space nor speed, and PEGs sound compelling in that regard...

Anyway, the implementation based on my nonlinear parsers worked out really nice wrt. the code structure and doesn't show any of the kinks that plagued the Parsec version. Since I've deviated somewhat from Markdown syntax in the places I didn't like, I've dubbed the package k-tex. I've still got to update the documentation but if anyone is interested in looking at or even improving the code, you can find it at http://www.khjk.org/~sm/code/k-tex/.

Best regards, Sven Moritz

PS. Yep, the Advogato gateway[7] already uses k-tex, and if this post appears on my blog[8], it's working. ;)

References:

[2]
[3]
[4]
"Structural plain-text, next iteration"
http://www.advogato.org/person/pesco/diary.html?start=17
[5]
[8]

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!