1 Jun 2006 pesco   » (Journeyer)

Representing marked-up text in Haskell

I wrote previously about my plan to use Markdown as the input format for advopost. I decided against re-using the existing Markdown-to-HTML converter, because I would have to strip the resulting output down to the Advogato subset of HTML in postprocessing; feels too clutchy. So I'm going to implement a parser for (a variant of) Markdown that reads the input into a structured Haskell data type 'Doc'. Here is my current design for that type:

   module Doc where
   data Doc     =  Doc         String      -- title
                               [Para]      -- body
                               [Doc]       -- subsections 
   data Para    =  Paragraph   String      -- paragraph title
                               [Block]     -- paragraph body
   data Block   =  Blockquote  Doc
                |  Bulleted    [[Para]]    -- unordered list
                |  Numbered    [[Para]]    -- numbered list
                |  Codeblock   [[Inline]]  -- list of lines, ignore Codespans
                |  Line        [Inline]
   data Inline  =  Str         String      -- ignore linebreaks
                |  Codespan    [Inline]
                |  Emph        [Inline]
                |  Link        [Inline]    -- link text
                               String      -- link target
                               String      -- link title
                |  Image       [Inline]    -- fallback alternative for this image
                               String      -- image location
                               String      -- image title

I want both the input format and the Haskell data structure to be independent of the output format being HTML. Therefore I'm not going to support inline-HTML in the input. I also want structural markup (as opposed to presentational), so I left out horizontal rules and forced linebreaks. Lastly, I've never heard of using a "strong emphasis" (as opposed to normal emphasis) in typesetting, so I dropped that as well.

I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.. Comments welcome.

I hope that the 'Doc' type will be useful in further coding. For example, it would be really cool to have a fancy combinator library for 'Doc's along with a pretty-printer to turn them back into plaintext: Then we could use them for general pretty output from Haskell programs. While there are several existing pretty-printing libraries, to my knowledge none of them use structural markup and they are all targeted at console output only.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!