Paul Graham has some interesting ideas about how to filter spam. Unfortunately he gives code samples in some weird language, so I've translated into Python and fleshed them out a bit.
This code ignores duplicates of a single token in a message, since they resulted in some rather ugly hacks. It also implements Raph's suggestion for using scores. Also, rather than summing the fifteen scores whose absolute values are highest, it sums the eight largest positive and eight most negative, which is much more robust behavior.
This code has no support for serialization and doesn't strip binary attachments, but other than that should work well.
from math import log from re import findallclass spamtrap: def __init__(self): self.good = {} self.bad = {} self.scores = {}
def _recompute(self, token): g = 2 * self.good.get(token, 0) b = self.bad.get(token, 0) if g + b >= 5: self.scores[token] = min(4.6, max(-4.6, log(b + .00001) - log(g + .00001)))
def add_spam(self, message): for t in _maketokens(message): self.bad.setdefault(t, 0) self.bad[t] += 1 self._recompute(t) def add_nonspam(self, message): for t in _maketokens(message): self.good.setdefault(t, 0) self.good[t] += 1 self._recompute(t) def is_spam(self, message): ss = [self.scores.get(t, -1.4) for t in _maketokens(message)] ssp = [i for i in ss if i > 0] ssn = [i for i in ss if i < 0] ssp.sort() ssn.sort() sum = 0 for v in ssp[-8:] + ssn[:8]: sum += v return sum > 2.2
def _maketokens(message): ts = {} for t in findall("[a-zA-Z0-9'$]+", message): ts[t.lower()] = 1 return ts.keys()
Sorry if there are indentation problems on cut and paste - advogato inserts <p> tags even between <pre> tags.