17 Aug 2002 Bram   » (Master)

Translating into English

Paul Graham has some interesting ideas about how to filter spam. Unfortunately he gives code samples in some weird language, so I've translated into Python and fleshed them out a bit.

This code ignores duplicates of a single token in a message, since they resulted in some rather ugly hacks. It also implements Raph's suggestion for using scores. Also, rather than summing the fifteen scores whose absolute values are highest, it sums the eight largest positive and eight most negative, which is much more robust behavior.

This code has no support for serialization and doesn't strip binary attachments, but other than that should work well.

from math import log
from re import findall

class spamtrap: def __init__(self): self.good = {} self.bad = {} self.scores = {}

def _recompute(self, token): g = 2 * self.good.get(token, 0) b = self.bad.get(token, 0) if g + b >= 5: self.scores[token] = min(4.6, max(-4.6, log(b + .00001) - log(g + .00001)))

def add_spam(self, message): for t in _maketokens(message): self.bad.setdefault(t, 0) self.bad[t] += 1 self._recompute(t) def add_nonspam(self, message): for t in _maketokens(message): self.good.setdefault(t, 0) self.good[t] += 1 self._recompute(t) def is_spam(self, message): ss = [self.scores.get(t, -1.4) for t in _maketokens(message)] ssp = [i for i in ss if i > 0] ssn = [i for i in ss if i < 0] ssp.sort() ssn.sort() sum = 0 for v in ssp[-8:] + ssn[:8]: sum += v return sum > 2.2

def _maketokens(message): ts = {} for t in findall("[a-zA-Z0-9'$]+", message): ts[t.lower()] = 1 return ts.keys()

Sorry if there are indentation problems on cut and paste - advogato inserts <p> tags even between <pre> tags.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!