Structured text in Haskell redux, or "Posting to Advogato ASCII style"
A few weeks ago I posted my initial thoughts about how to represent
a structured text document in Haskell. As often with initial
thoughts, they were flawed. ;-) In particular, I wrote:
I've tried to design the above types in such a way as to minimize the
possibility of forming non-sensical or ambiguous documents. That's
why there is such deep nesting of different types instead of just one
big algebraic data type with constructors for concatenation,
paragraph and section breaks, etc.
Without going into too much detail, I stumbled over the problem in
the representation of block quotes: The corresponding constructor
expected a full Doc and I couldn't break sufficiently small fragments
out of those for quoting. For example, had I wanted to quote one item
out of a numbered list, it would have gotten the number "1" again in
the quote, because it would have been the first item within that
document (fragment)...
So, I thought, should I make one big data type for the whole document
after all? But still, I didn't want to include too much possibilities for
such senseless combinations as bulleted lists and blockquotes
within section headings. After some pondering I've decided on using
a two-level structure, with one data type representing documents
(or document fragments) at the block level (paragraphs, sections, etc.)
and a second one for in-line structures only. In particular, a document
is taken to be a list of "block-level tokens" ('Btok's):
type Doc = [Btok]
Analogously, (logical) lines are represented by line-level tokens:
type Line = [Ltok]
Apart from plain strings, these consist of things like emphasis,
code spans, etc.:
data Ltok = St String -- character string
| Em Line -- emphasis
| Co String -- inline code
...
Note that under their intended interpretation, these constructors form
homomorphisms into the monoid of
Ltok-lists (wrapping their
result in a single-element list): e.g.
[St a] ++ [St b] is expected
to be equal to
[St (a++b)],
[Em a] ++ [Em b] == [Em (a++b)], etc.
This enables one to split a line in two at an arbitrary point without
"losing any information". :-)
I noticed that if something like the above could hold for the
block-level Doc type as well, my block quote problem would be
solved pretty elegantly. And indeed, I found the following
representation:
data Btok = TEXT [Line] -- text block (logical lines)
| PARA Line -- new paragraph (w. opt. title)
| SECT Int Line -- new section (of given level)
| QUOT Doc -- blockquote
| BULL [Doc] -- bulleted list
...
Notice that the
PARA and
SECT constructors do not wrap the
corresponding paragraph or section. Their meaning is analogous to
that of a new-line character in plain-text: they start a new structure
which spans everything up to the next break. This is exactly what makes
QUOT work: We can break the document apart in the middle of a
section or paragraph and rightfully consider the resulting parts
fragments in the common sense that they can be glued back together to
form the original. Or to put it differently, a quoted fragment
carries all of its "own" structural information (where the meaning of
"own" also results from the data type's structure).
"Wait, there's more fun to be had!"
Remember that the whole point of my foray into structured documents
originated in my little project of posting to my Advogato diary via
email. So, say I had a document to be posted in memory as a value
of type Doc as above. I'd want to convert it to HTML (or Advogato's
subset thereof). While the representation of section breaks as
stand-alone SECTs within a stream of block tokens closely fits with
the <h1-6> tags to appear in the HTML, paragraphs are supposed to
be wrapped in <p> tags. One obvious (but annoying) solution to
this discrepancy would be to build a second data type, probably a
little more like my original idea again, and convert to HTML via the
detour of this intermediate type.
Luckily I remembered that Oleg[1] wrote something about folds being
superior to cursors[2] as interfaces to data collections, so I
implemented a function of the following humiliating type signature.
;)
-- Fold over a Doc's block-level structure.
folddoctree
:: (Biblio -> Line -> line) -- logical line
-> ([line] -> tok) -- text block
-> (String -> tok) -- code block
-> (doc -> tok) -- blockquote
-> ([doc] -> tok) -- bulleted list
-> (Int -> [doc] -> tok) -- numbered list
-> ([(String,line)] -> tok) -- bibliography
-> ( line -- para. title
-> [tok] -- para. content
-> para ) -- paragraph
-> ( Int -- sect. level
-> line -- sect. title
-> doc -- sect. content
-> sect ) -- section
-> ( [para] -- initial paragraphs
-> [sect] -- following sections
-> doc ) -- document
-> Doc -> doc
As you can see, it takes ten functions as parameters, each telling it
what to do with a certain part of the document, which it then
transforms accordingly. Given this function, an HTML export for
Docs is straight-forward to define.
NB. Notice how the different type variables are threaded through
the type signature. It feels almost like LISP, in the sense that the
functions can return any type they want, as long as everything fits.
But unlike in a dynamically-typed language, the typechecker will tell
you up-front if you make a mistake. :-P
"Show me the code!"
Ah right, most important! The full definitions of the Doc type, the
folddoctree function, and some further utilities can be seen at the
following address:
http://www.khjk.org/~sm/code/advopost/Doc.lhs
Enjoy! :-)
References:
[1]
[2]