8 Jun 2007 slamb   » (Journeyer)

Training server-side Bayesian filters

Last night I worked on an unobtrusive way to train SpamAssassin's Bayesian database. (Autotraining sure spam and ham as it's delivered is nice, but you at least need a way of correcting its mistakes or it will keep making them.) The sa-learn utility is quite easy to use, but how do you specify what messages to feed to it? I haven't seen any good glue for this. You want to feed it messages which have been examined and categorized, and ideally you want to feed it each message exactly once. (sa-learn does realize that it's seen a message before, but it still takes some processing time to do even that.)

I decided to harness the power of RFC 2060. My trainer connects via IMAP4rev1, executes a SEARCH command for candidates (letting the server do the work of an arbitrarily complex query), downloads the messages and pipes them through sa-learn, flags them as learned (so the next search will skip them), and disconnects. I implemented it using imapfilter, and so far it works quite well. This approach would even work well if the SpamAssassin machine were separate from the mail store machine.

In the process, I noticed that Thunderbird updates spam status on the IMAP server in the Junk and NonJunk keywords. Mail.app does the same, in the Junk and NotJunk keywords (plus a few others). Did you see it? One uses NonJunk, the other NotJunk. How hard would it have been to get these guys in a room to fight this one out? Grr. They have a weird interaction because they just didn't put any thought into it.

I also tried out Lua for the first time, as it's imapfilter's extension language. Turns out I hate it. I really wanted to like it. I had been thinking of using it all over an embedded product for rapid development with little resources. It's minimalist, fast, and so on. But it's just unpleasant to use. Maybe it's too minimalist. I would have liked a separate array type (rather than just "tables" / associate arrays), and I hate "high-level" languages without exceptions. imapfilter's library is also a bit limiting - its fetch_message and pipe_to do everything in memory. That makes me more irritated that Lua doesn't just have an array slice syntax I can use to pass message lists to fetch_message. And it means I have to spawn sa-learn a bunch of times for reasonable memory consumption, and starting a Perl process heavy with modules takes a long time.

I might end up rewriting my trainer in Python using either imaplib and subprocess or twisted.mail.imap4 and twisted.internet.process. I'm not real impressed with either mail API, though. I like the JavaMail API better, but forking and interacting with child processes from Java (or even Jython) sounds painful.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!