11 Apr 2004 Bram   » (Master)

Manual Spam Filters

I noticed the other day that most of the viruses I get have a Message-Id tacked on by the local mailserver. A little bit of messing with procmail and suddenly my junk mail level is under control. I'm now spending a few minutes a day creating new rules throwing out junk mail my old filters missed. I'm sticking with only blocking dead giveaways instead of using complicated scoring rules, and thus far it's working extraordinarily well.

It's too bad that bounces have become effectively useless, since there are so many of them sent in response to forged mail. The only way to fix this would be for the local mail server to record all the Message-Ids it sends out, and only accept bounces which are in response to one of those. Given the quality of current email infrastructure, I'm not holding my breath.

It's amazing what a downer junk mail can be. I'm in much improved spirits now that I'm actually getting all my real mail instead of accidentally deleting it along with the junk.

blinky.at.or.at is sending me huge amounts of junk.

Ebay correction

The method of gaming Ebay I gave earlier isn't nearly as easy as I stated. The problem is that the first out-bid by minimum amount you do might actually win, because you can't tell if the current price is the current second place bid plus the minumum increment or the current first place bid exactly. The attack still works, but to do it you need to do multiple bids increasing by exactly the minumum increment each time, which is a much more flagrant.

Ebay is probably less worried about this than plain old mail fraud. One thing which helps is that auctions are frequently won by snipers, and these tricks don't work when they're bidding. I've taken to sniping everything I really care about, because it saves money.

Trust Metric Correction

The anti-spam trust metric I gave earlier can be easily defeated. A spammer certs a whole lot of fake identities, and from those fake identities, and from those fake identities cert just a few fake identities which they spam from. This will keep the negative certs from propogating back.

Here is a technique which fixes that. For each node, we figure out what the flow of spam looks like viewing that node as the source of everything (how to do this is explained below). If more than half of all such flows label a node as a spammer, then that node is assumed to be a spammer. It's generally impossible for a single node to not be labelled as a spammer in all cases, because when that node is is the source then it gets a very high spam rating, and when any of the nodes which certify it are viewed as the root it also gets a very high spam rating, although not quite so high.

To calculate the flow of spammerness into a single root, calculate flows between nodes along certification lines and though nodes based on the following criteria:

  • Spammerness flows backwards along certification lines, from the node certified to the one which did the certifying.
  • The inflow out outflow of each node sums to zero, except for nodes which have negative certs, which have excess outflow in the amount of their negative certs. The source node is another exception, it can have unlimited excess inflow.
  • The maximum flow through any node is set to be as small as possible. The tiebreak for multiple possible flows with this value the same is that the second greatest maximum flow is as small as possible, then third, etc.

For example, if node A is the source, and A certs B and C, and B and C both cert D, and D has 1 negative cert, then the spammerness ratings for A and D are both 1 and for B and C it's both 1/2.

A threshold is set above which a node is considered to be a spammer. Typically all nodes close to the source wind up looking like spammers and all actual spammer nodes look like spammers.

This technique should of course be combined with distinguishing 'confused' nodes, which are judged as being spammers but are certified by at least one non-spammer node. Confused nodes only get blocked from sending messages if they have directly spammed above the threshold. Without this, very highly connected nodes will sometimes get blocked simply because they're too highly connected, and hence wind being near the source node for all possible sources.

Calculating this exactly might take a while, but there are decent polynomial time approximation algorithms. I'm not sure how small it's possible to get the exponent, but I'm following the policy of designing behavior first, and worrying about efficiency second.

Stamps Against Spam

Ken Birdwell, a coworker of mine, has an interesting idea about how to use stamps against spam. The basic premise of stamps is that a stamp represents some kind of resource, typically a penny, which a sender must have put out in order to get mail accepted. The question is, if the mail wasn't spam, does the penny go back to the sender, get transferred to the receiver, or simply drop into the ether? Ken's observation is that most people send and receive about the same amount of mail, so you could simply have the stamp transfer ownership, without ever turning it back into a penny. This should keep things about at equilibria, and designing a protocol for it is trivial - a client sends a stamp to a server, and the server responds with either a reject or an accept with a new valid stamp.

A few modifications to this scheme can help maintain equilibrium. First, since a lot of mail is sent to dead accounts or at least accounts which don't use this scheme, a sender should try to collect stamps they issued after a month or so to keep them from disappearing into the ether. Second, peers which have an excess of stamps should start probabilistically not caching in (or caching in and sending back) stamps they receive.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!