23 Aug 2002 nomis   » (Master)

spam - multi language problems

The graham/bayes approach to spam is interesting and seems to work quite well. However, it seems to have pretty major issues with multi-language mails and I am not sure how to fix this in a convenient manner.

I get lots of "good" english and german Mail, but there is by far more english spam than german spam in my inbox. This has the effect that a word that should appear in nearly every german mail like e.g. "ein" appears rarely in spam mails and more frequently in good mails. Suddenly a word that should behave neutral for detecting spam becomes a witness for a good mail. In the case of "ein" the spam probability is 0.05 in my database.

It is not that bad because I do not get too much german spam. However, it seems like a fundamental problem to me and it most probably cannot be adressed without different databases and a way to determine what language a mail contains (this most probably can work the same way as distinguishing between spam/nonspam). However, the training/sorting work would increase significantly - I usually don't sort my mails by language...

On the other hand the very same effect is useful for me with CJK-Mails - I don't speak any of these languages so there are no "good" CJK-Mails in my inbox. It is perfectly reasonable that the filter classifies them as spam...

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!