Advogato: Blog for pesco

Representing marked-up text in Haskell

I wrote previously about my plan to use Markdown as the input format for advopost. I decided against re-using the existing Markdown-to-HTML converter, because I would have to strip the resulting output down to the Advogato subset of HTML in postprocessing; feels too clutchy. So I'm going to implement a parser for (a variant of) Markdown that reads the input into a structured Haskell data type 'Doc'. Here is my current design for that type:

   module Doc where

   data Doc     =  Doc         String      -- title
                               [Para]      -- body
                               [Doc]       -- subsections

   data Para    =  Paragraph   String      -- paragraph title
                               [Block]     -- paragraph body

   data Block   =  Blockquote  Doc
                |  Bulleted    [[Para]]    -- unordered list
                |  Numbered    [[Para]]    -- numbered list
                |  Codeblock   [[Inline]]  -- list of lines, ignore Codespans
                |  Line        [Inline]

   data Inline  =  Str         String      -- ignore linebreaks
                |  Codespan    [Inline]
                |  Emph        [Inline]
                |  Link        [Inline]    -- link text
                               String      -- link target
                               String      -- link title
                |  Image       [Inline]    -- fallback alternative for this image
                               String      -- image location
                               String      -- image title

I want both the input format and the Haskell data structure to be independent of the output format being HTML. Therefore I'm not going to support inline-HTML in the input. I also want structural markup (as opposed to presentational), so I left out horizontal rules and forced linebreaks. Lastly, I've never heard of using a "strong emphasis" (as opposed to normal emphasis) in typesetting, so I dropped that as well.

I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.. Comments welcome.

I hope that the 'Doc' type will be useful in further coding. For example, it would be really cool to have a fancy combinator library for 'Doc's along with a pretty-printer to turn them back into plaintext: Then we could use them for general pretty output from Haskell programs. While there are several existing pretty-printing libraries, to my knowledge none of them use structural markup and they are all targeted at console output only.

1 Jun 2006 pesco » (Journeyer)