Older blog entries for pesco (starting at number 17)

Structural plain-text, next iteration

So, the first working version of my "structural plain-text" parser is just working and I'm already dissatisfied with it. Well, the reasons are simple:

  1. It doesn't work correctly in all cases.

  2. The code is a mess, even though it was already the third or so iteration internally ("plan to throw...", you know).

So here we go for the next iteration. I hope I'll be able to make the parser code substantially cleaner and more modular and implement a few cool-to-have additional features along the way. But first I decided to carefully write out a description of the intended grammar. I just finished the first draft which is still missing details on the planned new features. It's probably still full of inaccuracies because I've changed some details from last time (in hopes of simplifying both the parser and the syntax explanation) and I haven't implemented the actual parser, yet. Still, if you're interested, have a look[1].

PS. Oh yeah, after all, I've included "strong emphasis" in this version, because there are uses for it, specifically in technical documentation (which I intend to write). It carries a Fat Warning (TM) in the docs, so I'm happy and people can do whatever they want. ;-)

Structured text in Haskell redux, or "Posting to Advogato ASCII style"

A few weeks ago I posted my initial thoughts about how to represent a structured text document in Haskell. As often with initial thoughts, they were flawed. ;-) In particular, I wrote:

I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.

Without going into too much detail, I stumbled over the problem in the representation of block quotes: The corresponding constructor expected a full Doc and I couldn't break sufficiently small fragments out of those for quoting. For example, had I wanted to quote one item out of a numbered list, it would have gotten the number "1" again in the quote, because it would have been the first item within that document (fragment)...

So, I thought, should I make one big data type for the whole document after all? But still, I didn't want to include too much possibilities for such senseless combinations as bulleted lists and blockquotes within section headings. After some pondering I've decided on using a two-level structure, with one data type representing documents (or document fragments) at the block level (paragraphs, sections, etc.) and a second one for in-line structures only. In particular, a document is taken to be a list of "block-level tokens" ('Btok's):

type Doc = [Btok]
Analogously, (logical) lines are represented by line-level tokens:
type Line = [Ltok]
Apart from plain strings, these consist of things like emphasis, code spans, etc.:
data Ltok =  St  String           -- character string
          |  Em  Line             -- emphasis
          |  Co  String           -- inline code
          ...
Note that under their intended interpretation, these constructors form homomorphisms into the monoid of Ltok-lists (wrapping their result in a single-element list): e.g. [St a] ++ [St b] is expected to be equal to [St (a++b)], [Em a] ++ [Em b] == [Em (a++b)], etc. This enables one to split a line in two at an arbitrary point without "losing any information". :-)

I noticed that if something like the above could hold for the block-level Doc type as well, my block quote problem would be solved pretty elegantly. And indeed, I found the following representation:

data Btok =  TEXT  [Line]    -- text block (logical lines)
 
          |  PARA  Line      -- new paragraph (w. opt. title)
          |  SECT  Int Line  -- new section (of given level)
 
          |  QUOT  Doc       -- blockquote
          |  BULL  [Doc]     -- bulleted list
          ...
Notice that the PARA and SECT constructors do not wrap the corresponding paragraph or section. Their meaning is analogous to that of a new-line character in plain-text: they start a new structure which spans everything up to the next break. This is exactly what makes QUOT work: We can break the document apart in the middle of a section or paragraph and rightfully consider the resulting parts fragments in the common sense that they can be glued back together to form the original. Or to put it differently, a quoted fragment carries all of its "own" structural information (where the meaning of "own" also results from the data type's structure).
"Wait, there's more fun to be had!"

Remember that the whole point of my foray into structured documents originated in my little project of posting to my Advogato diary via email. So, say I had a document to be posted in memory as a value of type Doc as above. I'd want to convert it to HTML (or Advogato's subset thereof). While the representation of section breaks as stand-alone SECTs within a stream of block tokens closely fits with the <h1-6> tags to appear in the HTML, paragraphs are supposed to be wrapped in <p> tags. One obvious (but annoying) solution to this discrepancy would be to build a second data type, probably a little more like my original idea again, and convert to HTML via the detour of this intermediate type.

Luckily I remembered that Oleg[1] wrote something about folds being superior to cursors[2] as interfaces to data collections, so I implemented a function of the following humiliating type signature. ;)

-- Fold over a Doc's block-level structure.
folddoctree
  ::  (Biblio -> Line -> line)     -- logical line
 
      -> ([line] -> tok)           -- text block
      -> (String -> tok)           -- code block
      -> (doc -> tok)              -- blockquote
      -> ([doc] -> tok)            -- bulleted list
      -> (Int -> [doc] -> tok)     -- numbered list
      -> ([(String,line)] -> tok)  -- bibliography
 
      -> (  line                   -- para. title
            -> [tok]               -- para. content
            -> para  )             -- paragraph
 
      -> (  Int                    -- sect. level
            -> line                -- sect. title
            -> doc                 -- sect. content
            -> sect  )             -- section
 
      -> (  [para]                 -- initial paragraphs
            -> [sect]              -- following sections
            -> doc  )              -- document
 
      -> Doc -> doc
As you can see, it takes ten functions as parameters, each telling it what to do with a certain part of the document, which it then transforms accordingly. Given this function, an HTML export for Docs is straight-forward to define.

NB. Notice how the different type variables are threaded through the type signature. It feels almost like LISP, in the sense that the functions can return any type they want, as long as everything fits. But unlike in a dynamically-typed language, the typechecker will tell you up-front if you make a mistake. :-P

"Show me the code!"

Ah right, most important! The full definitions of the Doc type, the folddoctree function, and some further utilities can be seen at the following address:

http://www.khjk.org/~sm/code/advopost/Doc.lhs

Enjoy! :-)

References:

[1]
Oleg Kiselyov, Haskell (and Scheme (and probably more)) guru. See http://okmij.org/.
[2]
Oleg Kiselyov: "Towards the best collection traversal interface". http://okmij.org/ftp/Computation/Continuations.html#enumerator-stream
11 Jun 2006 (updated 11 Jun 2006 at 22:30 UTC) »
Fun with email -- or how I spent my day

I'd like to share the results of my latest "freshmeat" spree. For context, I keep all my email on a University server, so I can access it from anywhere (No, there are no webmailers!). I have access through IMAP which I used with Thunderbird, until yesterday -- when I finally got fed up with it. Thunderbird regularly mixes up its idea of which mail folders there are on the IMAP server, and where. So there I was, looking for a new MUA. Lot's of dissatisfaction ensued. Until I found two wonderful programs:

  1. OfflineIMAP
  2. Mutt-NG

The first does something very simple (but in a smart way): It synchronizes IMAP mailboxes with a local repository. Both ways. So I set it up with all my IMAP settings (through a nice, and "unuserfriendly" config file -- sic!) and let it sync every five minutes. New mails get pulled into local Maildir folders where any old MUA can conveniently handle them. And whatever changes I make to them are propagated back to the server. And no, it doesn't lose any mails; so the author assures. The author, btw., turns out to be a fellow Haskell programmer -- John Goerzen, thank you: You just made my day!

Thanks to OfflineIMAP I can finally use my old-time favourite MUA again: Mutt. But wait, as I discovered, there is even Mutt-NG now, which finally adds the one thing I had been missing all along: A sidebar that shows an overview of (new/old/etc.) messages in my incoming mail folders! Oh, and for extra pleasure, it also speaks SMTP, so I don't even have to set up a local delivery agent.

And here's some more general Mutt praise: With just a few pretty simple config file (sic!) settings, mutt pretty much takes care of moving read mails away into yearly archive folders all by itself:

save-hook ~A =sa-spam

This means that pressing 's' to move a message to another folder presents sa-spam as default, which happens to be the folder my spamassasin learns from.

set record="=Archive/`date +%Y`/Sent"

This is where mutt saves Fcc's, i.e. copies of sent messages. Notice the automatic year selection.

set mbox="=Archive/`date +%Y`/Inbox"
mbox-hook =Somelist   =Archive/`date +%Y`/Somelist
mbox-hook =Otherlist  =Archive/`date +%Y`/Otherlist

Finally, these are the real fun: Remember how default Mutt setups usually ask on quit whether they should save read messages to "mbox" in your homedir? A completely stupid question, nobody wants that. But with the above, this suddenly changes to something sensible: The correct archive folder for the current year, sorting different input folders into a corresponding archive subfolder. Wonderful.

Maybe you cannot really appreciate how good this is. But that's usually the case with improved usability, you don't notice until afterwards how much better things feel with just that little tweak. Compare my previous post about bitlbee. Life just started being fun again! Time to throw out all those black clothes and start listening to eurodance.

Enjoy.

PS. Friendly apologies for praising config files go to lkcl who recently bitched about "this crap obscure text file editing shit". ;-P

10 Jun 2006 (updated 10 Jun 2006 at 16:00 UTC) »
"About the Swedish analysis of Nazi Germany's crypto teleprinters"

I've just returned from a two-semester exchange at the Charles University[1] of Prague. There I had the chance to take a course about classical cryptography, for which I wrote an essay that might be interesting to others:

Everybody knows about the Enigma machine[2]. It was a cryptographic device developed by the Germans, used extensively by Hitler's troops, and famously broken by the allies at Bletchley Park. It was an "offline" device in that the plaintext was keyed in letter by letter, indicating the corresponding ciphertext character with each keystroke. Now, what many are not aware of is that the Germans also used a series of online crypto devices during WWII. One of these was the Siemens T52[3], a.k.a. the "Geheimschreiber". This machine was an actual teleprinter with built-in automatic encryption and decryption, i.e. it encrypted and transmitted the characters as they were typed (or fed via punched tape), and vice-versa. It's cryptographic algorithm is considered significantly more sophisticated than that of the Enigma.

When Germany leased several telegraph lines from Sweden they were immediately tapped but the T52 was soon employed on them. My essay[4] describes how the Swedish cryptanalyst Arne Beurling broke the encryption without ever having seen one of the machines. It's mainly a condensed form of the description from "Codebreakers"[5], with less history and more mathematics.

Enjoy.

References:

[1]:
"Univerzita Karlova v Praze"
http://www.cuni.cz/
[2]:
Wikipedia: "Enigma machine"
http://en.wikipedia.org/wiki/Enigma_machine
[3]:
Wikipedia: "Siemens and Halske T52"
http://en.wikipedia.org/wiki/Siemens_and_Halske_T52
[4]:
Sven Moritz Hallberg: "About the Swedishh analysis of Nazi Germany's crypto teleprinters" (2006) http://www.khjk.org/~sm/distfiles/sm-2006-gschreiber.pdf
[5]:
Bengt Beckman: "Codebreakers: Arne Beurling and the Swedish Crypto Program during World War II", Oxford University Press (2003)

Update: Typesetting a reference list in Advogato

For those of you who, like me, are not completely fluent in CSS (and maybe a little stuck with HTML wisdom of the 90's), here's how you produce hanging indents as in the reference list above:

  1. Somehow determine the width of your left column. For the reference list, I took the maximum number of characters in it plus one, times the width of the 'x' character (ex): "[n]:" are four characters, so I use 5ex.
  2. Put the left column text into a <div> and give it the attribute 'style="float: left"', so that the following text will flow down to the right of it.
  3. Put the right column text into another <div> and give it the attribute 'style="margin-left: 5ex"' (replacing 5ex with whatever left-column width you want). This results in the hanging indent.
For reference, here's the code to the first reference entry above:
<div style="float: left">[1]:</div>
<div style="margin-left: 5ex">
  "Univerzita Karlova v Praze"<br>
  <a href="http://www.cuni.cz">http://www.cuni.cz/</a>
</div>

For a reference list, it also looks good to indent the whole thing a little by putting it in another <div style="padding: 1ex">.

Enjoy

1 Jun 2006 (updated 1 Jun 2006 at 10:23 UTC) »
Representing marked-up text in Haskell

I wrote previously about my plan to use Markdown as the input format for advopost. I decided against re-using the existing Markdown-to-HTML converter, because I would have to strip the resulting output down to the Advogato subset of HTML in postprocessing; feels too clutchy. So I'm going to implement a parser for (a variant of) Markdown that reads the input into a structured Haskell data type 'Doc'. Here is my current design for that type:

   module Doc where
   data Doc     =  Doc         String      -- title
                               [Para]      -- body
                               [Doc]       -- subsections 
   data Para    =  Paragraph   String      -- paragraph title
                               [Block]     -- paragraph body
   data Block   =  Blockquote  Doc
                |  Bulleted    [[Para]]    -- unordered list
                |  Numbered    [[Para]]    -- numbered list
                |  Codeblock   [[Inline]]  -- list of lines, ignore Codespans
                |  Line        [Inline]
   data Inline  =  Str         String      -- ignore linebreaks
                |  Codespan    [Inline]
                |  Emph        [Inline]
                |  Link        [Inline]    -- link text
                               String      -- link target
                               String      -- link title
                |  Image       [Inline]    -- fallback alternative for this image
                               String      -- image location
                               String      -- image title

I want both the input format and the Haskell data structure to be independent of the output format being HTML. Therefore I'm not going to support inline-HTML in the input. I also want structural markup (as opposed to presentational), so I left out horizontal rules and forced linebreaks. Lastly, I've never heard of using a "strong emphasis" (as opposed to normal emphasis) in typesetting, so I dropped that as well.

I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.. Comments welcome.

I hope that the 'Doc' type will be useful in further coding. For example, it would be really cool to have a fancy combinator library for 'Doc's along with a pretty-printer to turn them back into plaintext: Then we could use them for general pretty output from Haskell programs. While there are several existing pretty-printing libraries, to my knowledge none of them use structural markup and they are all targeted at console output only.

31 May 2006 (updated 31 May 2006 at 20:02 UTC) »

Posting to Advogato via email

If this messages reaches my Advo diary, my new mail-to-advo gateway program is not only working but also correctly embedded into my mail setup via procmail. Thus, albeit being undocumented, underfeatured, and unpolished, the program is quite functional. Lest I forget it myself, here are the quick set-up instructions:

Step 1: Dependencies.

The program is written in Haskell and has been tested with GHC 6.4.1. It requires a number of Haskell packages, namely: MissingH (0.13.1), HaXml (1.13), NewBinary (2005-05-30), Crypto (3.0.3), HaXml (1.13), http (2005-04-23), haxr (2006-05-30, custom-patched)

I've already submitted my patches for haxr (the XML-RPC library) to its author, Björn Bringert. Hopefully, he'll accept them soon. In short, the particular problem was that Advogato returns <value> elements with whitespace (i.e. character data) around the actual data element. I had to modify a function called fromXRValue to filter them out -- /unless/ the whole <value> consists of /only/ character data, which is supposed to be interpreted equivalently to a <string> data element. Duh. (Rant: This format is supposed to be processed by /machines/ for fsck's sake!)

Step 2: Fetch and build.

There's just a single source file, http://www.khjk.org/~sm/code/advopost/advopost.lhs. Fetch it and build with the straight-forward command 'ghc -o advopost --make advopost.lhs'. Make sure you've got all the above dependencies visible (I had to manually 'ghc-pkg expose' HaXml, I think).

Step 3: Procmailrc.

Choose an email account to direct your postings to. I'm posting to a mailing list and the posts loop back to my own account. There, the following procmail recipe passes any posts from me to that list through advopost:

:0 c
* ^TO_the@mailing.list
* ^From:.*my@email.addr
* ! ^In-Reply-To:
| /path/to/advopost my-advo-name my-advo-pass

Put this recipe at the beginning of your respective .procmailrc; the 'c' flag in the first line means "copy", i.e. a copy of the mail is passed to advopost and the original continues through the rest of .procmailrc. The third condition line makes this rule apply only to messages that don't carry an "In-Reply-To" header, i.e. only fresh posts are passed to Advogato, not my replies to other people's messages.

Enjoy.

PS. Er, naturally, advopost is open-source software. You may peruse the published version under the terms of the common 3-clause BSD license.

PPS. I know, this post is missing lots of markup. While advopost would pass it all through untouched, I don't want to burden the mailing list I'm cross-posting to with lots of HTML tags. My plan is to finally interpret the input to advopost as Markdown, cf. http://daringfireball.net/projects/markdown/.

Update. Aha, of course, non-ASCII characters come through wrong. I'm pretty sure I'm transmitting them as UTF-8, or at least so I had assumed. Will have to investigate...

Update 2. Yay, Björn has already accepted my patches! Advopost will now work with the latest haxr from darcs (2006-05-31).

30 May 2006 (updated 30 May 2006 at 09:45 UTC) »

Posting to Advogato via email

Sooo, I want to post to my diary by email. If you read this, the program is working. :)

Update. Oh, of course you want to see the code! There it is, in all its unpolished glory. Note that about 90% of it is concerned with working around bugs in the crappy email parser. ;-/

I've been toying around with the idea of creating a structural markup language (like XML) with LaTeX-like syntax and a simple semantics. My idea is to let what are macros in (La)TeX become functions. Then the available functions and their types form the analogue to, say, XML Schemas. Giving a concrete implementation for the functions provides an interpretation to the document, for example transforming it into some other format (like XSLT). I wonder if people think this is a good idea; or would even like to help with its development. I have produced an initial scratch implementation of a parser in Haskell. There is also a function to transform the parsed document into a Haskell expression.

As a next step, I should add a little main routine for reading in a document and spitting out a simple Haskell module which

  • imports some given module as an interpretation and
  • exports the source document under that interpretation (through the conversion to an expression, as described).
The obvious proof-of-concept toy, then, is of course an Advogato-exporter.
Another amazing discovery!

Have you ever wondered what it's like in hyperspace? I remember that Space Gothic explained the experience of "looking out the window in hyperspace" as instantly turning everyone insane, which I quite liked. Then, according to Event Horizon, as many will remember, there's just Hell in hyperspace, an also somewhat enjoyable theory. However, as anyone who was paying attention in the early nineties will instantly confirm, in the modern world of today, we know that there is disco music in hyperspace!

The scientific device by which the above revelation was established is called Star Control II and nowadays it is freely available for every skeptic to confirm this earth-shaking discovery.

15 Dec 2005 (updated 15 Dec 2005 at 14:14 UTC) »

Wow, I just found bitlbee! Now I can finally get rid of Gaim. For those who don't know, bitlbee is a Jabber,ICQ,MSN,etc. to IRC gateway. It runs as an IRC server on localhost, maintaining a channel which all your buddies from the different IM networks just join when they come on-line. So nice! Then you can either /msg them or just talk to them in the customary IRC way ("nick: blah") from the control channel.

Oh by the way, setup is really simple, just apt-get install bitlbee, /connect localhost, and follow the instructions of the bee.

8 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!