I wrote previously about my plan to use Markdown as the input format for advopost. I decided against re-using the existing Markdown-to-HTML converter, because I would have to strip the resulting output down to the Advogato subset of HTML in postprocessing; feels too clutchy. So I'm going to implement a parser for (a variant of) Markdown that reads the input into a structured Haskell data type 'Doc'. Here is my current design for that type:
module Doc where
data Doc = Doc String -- title [Para] -- body [Doc] -- subsections
data Para = Paragraph String -- paragraph title [Block] -- paragraph body
data Block = Blockquote Doc | Bulleted [[Para]] -- unordered list | Numbered [[Para]] -- numbered list | Codeblock [[Inline]] -- list of lines, ignore Codespans | Line [Inline]
data Inline = Str String -- ignore linebreaks | Codespan [Inline] | Emph [Inline] | Link [Inline] -- link text String -- link target String -- link title | Image [Inline] -- fallback alternative for this image String -- image location String -- image title
I want both the input format and the Haskell data structure to be independent of the output format being HTML. Therefore I'm not going to support inline-HTML in the input. I also want structural markup (as opposed to presentational), so I left out horizontal rules and forced linebreaks. Lastly, I've never heard of using a "strong emphasis" (as opposed to normal emphasis) in typesetting, so I dropped that as well.
I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.. Comments welcome.
I hope that the 'Doc' type will be useful in further coding. For example, it would be really cool to have a fancy combinator library for 'Doc's along with a pretty-printer to turn them back into plaintext: Then we could use them for general pretty output from Haskell programs. While there are several existing pretty-printing libraries, to my knowledge none of them use structural markup and they are all targeted at console output only.