Advogato: Extension of Greylisting to kill Spam: Distributed 'Approval

The problem (duh). Lessons from IM. What greylisting does.

The issue: spam. unsolicited email. Nobody envisaged, when email was created, that there would be such a massive problem. The SMTP protocol, which is overloaded, simply isn't equipped with any design measures to cope with abuse. Its sole and exclusive purpose, at which it is very good, is to deliver messages. That's why it's got the word 'Simple' in its name.

However, very quickly, Instant Messaging became popular, and you got this concept of 'Buddies', right, where... like... you could say, like... that you like... only wanted to hear from people whom you knew were one of your 'Buddies'?

And, like... this isn't fricking obvious that it should be applied to all inter-personal Internet communications? Mailing list software has these kinds of rules built-in. So, what's the hold-up? Why do we still have a spam problem?

Oh - wait: I know what the root cause of the problem is: the monoculture thing again. Again - we come back to that root cause yet again: namely, that Microsoft has set back the innovation of computing by nearly twenty years. Again we come back to the Canceration concept (a Canceration: a corporation whose sole, blatant and exclusive pathological purpose is to make profits above all else, consuming all resources).

Leaving that issue aside for another time, let's at least put the right infrastructure in place; let's give people the option to not have spam, and let's at least get rid of spam from our own back yard. And, given that it's spam-bots (a centrally-controlled, virus-distributed network of windows PCs that send out spam email using a subset of the SMTP protocol) that cause the problem, let's improve the most effective method (greylisting) to deal with that, and see where we get.

So - what exactly does greylisting do? There's a paper, by Evan Harris, on greylisting, and, from his site:

Greylisting is a new method of blocking significant amounts of spam at the mailserver level, but without resorting to heavyweight statistical analysis or other heuristical (and error-prone) approaches. Consequently, implementations are fairly lightweight, and may even decrease network traffic and processor load on your mailserver.
Greylisting relies on the fact that most spam sources do not behave in the same way as "normal" mail systems. Although it is currently very effective by itself, it will perform best when it is used in conjunction with other forms of spam prevention. For a detailed description of the method, see the Whitepaper.

In that paper, Copyright Mr Harris, 2003-2004, there is a section 'High Level Overview', which i quote from here:

The Greylisting method is very simple. It only looks at three pieces of information (which we will refer to as a "triplet" from now on) about any particular mail delivery attempt:
1. The IP address of the host attempting the delivery
2. The envelope sender address
3. The envelope recipient address
From this, we now have a unique triplet for identifying a mail "relationship". With this data, we simply follow a basic rule, which is:
If we have never seen this triplet before, then refuse this delivery and any others that may come within a certain period of time with a temporary failure.
Since SMTP is considered an unreliable transport, the possibility of temporary failures is built into the core spec (see RFC 821). As such, any well behaved message transfer agent (MTA) should attempt retries if given an appropriate temporary failure code for a delivery attempt.

In other words, legitimate email servers get through, and spam-bots, comprising about 95% of the world's SMTP traffic, don't bother to come back. Here's the issue: world-wide, not that many SMTP servers are running greylisting, and so we're "below the radar". It wouldn't take much effort on the part of the spammers to maintain a bit of state information, and greylisting suddenly becomes almost completely ineffective.

So - at that point, something else needs to be done. And what better time to do that than before the spammers decide that greylisting is worth catering for?

So what's the scoop?

it's actually quite simple. The normal procedure is this: when a message comes in, the greylisting triplet (IP, envelope sender, envelope recipient) is checked to see if it's been heard of before (in 'approved' and in the 'awaiting approval' queue). If it's in the 'awaiting approval' queue, or if it is neither queue, then "Please try later" is sent to the sender (and the triplet is added to the 'awaiting approval' queue if it wasn't already there). Only once the 'awaiting approval' timeout has been reached, which moves the triplet from the 'awaiting approval' queue to the 'approved' queue, will email messages be accepted.

That's the normal procedure.

Where the distributed part comes in is this: an extra step is added by downloading, from a distributed database (similar to pyzor), the number of occurrences of a triplet (or parts thereof) from other people's 'awaiting approval' queues.

There are two parts to the procedure. The first is that whenever your greylist daemon see a unique triplet that is already in your 'awaiting approval' queue, it immediately reports the triplet to the distributed database (perhaps it would be better to report several all at once - but that's an implementation detail).

The second part of the procedure is, when a triplet is already in the 'awaiting approval' queue, to download a count of the number of times that combinations of the triplet (IP+sender+recipient; IP+sender; IP+recipient; sender+recipient; IP; sender; recipient) have been seen before. All of these counts of the parts that make up an 'awaiting' triplet have very specific - and different - uses, when dealing with spam. For example, count on sender can help identify 419 scammers.

A decision can therefore be made to extend the greylisting timeouts from, for example, five minutes to over five hours, based on the number of occurrences of the IP address and/or the sender.

It's actually very simple. All that we are doing is providing the same communication rules that Instant Messaging has had, for nearly forever.

The difference is: SMTP is global. So, we have to deal with the problem. Globally.

What's the catch?

Well, if anyone can think of one - I'd obviously like to know. I can think of one that sounds like a problem, and it's based on me staring at SMTP traffic coming from bots, for several years. Many Spam-Bots send their messages to random (invented) names at your domain, and many of them send their messages to well-known (or well-used) names, such as "postmaster", "webmaster", "administrator" etc. It's the random name-delivering that I'd like to focus on, for a bit.

I don't honestly know what this random delivering is for - but I can hazard a guess. Perhaps it was the brainchild of someone thinking that if there are enough monkeys (remember, they have control of perhaps hundreds of thousands of computers) that generate enough email addresses at a particular domain, then they will at least hit a small percentage. What they are forgetting of course is that this only works against ISPs like gmail, yahoo, hotmail etc. In other words, what the monkeys I mean the spammers are forgetting is that the bottle-neck isn't the number of computers that you're using to distribute spam, it's the number of valid recipients on the end of your domain.

In other words, if the spammer is generating random recipients, they're hardly likely to come back and try them again. But if they do - the distributed 'awaiting approval' greylist idea is waiting to take them on.

So, that leaves emails that are being sent from those viruses that copy your contacts list, and forge up an email pretending to be from one of your friends. My favourite variation on this theme is the ones that send a virus pretending to be from one of your friends to one of your friends. To be honest: if spambots start doing this kind of attack via a better SMTP service that thwarts current greylisting, then, unless the contacts list is particularly long (stolen credit-card or bank account Nationwide Building Society fined $1m long) then it's really not relevant, and anti-virus and other spam analysis techniques can catch it.

Of particular concern is the one that sends from a random entry in a person's contact list, to a random entry, taking random text from documents one the person's hard drive, attaching an image that contains a buffer-overflow with an embedded virus. The random text is there to defeat bayesian analysis. The virus is there to take over more monoculture machines.

To be absolutely and brutally honest about this kind of attack: I couldn't care less. If people want to be stupid enough to believe the hype about Microsoft - that it is their only choice - then they deserve everything that they get.

But - at least, it might be possible to help such poor people, by detecting that their email address was coming up 'red' on many machines, and doing graph analysis on several sender-recipient tuples.

So. this leaves us with at least being able to detect 419 scammers, and the machines from which they are sending out their scams.

You know - the people who register an email address with the sole purpose of harvesting responses so that they can invite them to have their money stolen. Oh, man, are they going to be pissed when they find that, world-wide, after the first thousand or so attempted messages (remember - greylisting is done at MTA time!), that they can only send out about one email every five hours.

What else. oh yes: false positives.

Possible false positives

Recall that i said that the IP address should be noted when it disobeys the 'please try later' rules? Well, there are servers out there that disobey the RFCs. After very little thought, I don't believe that the extensions to greylisting that I propose make any difference, other than the fact that such servers would be detected much quicker.

I make this conclusion based on the fact that greylisting causes problems for such stupid servers anyway, and if you really want to hear from them, you have to specifically add them into the whitelist of your greylist daemon's configuration file.

Actually, it does make a difference: if enhanced greylisting became popular, then those servers that were disobeying the SMTP RFCs 45x responses would stick out much much quicker.

I do know of some dickheads who thought that it would be sensible to attempt to deliver email once per minute for an hour, then give up for eight hours, and try again, then give up for a further twelve hours, and try again. This of course proved to be completely ineffective when greylisting was in use, because the 'approved' triplets are only stored for eight hours, and are discarded if they're not used in time!

So my poor client, expecting to receive a critical email at 5:30pm, actually received it at 11am on sunday, when everybody had gone home for the weekend. The litany of the analysis I wrote to the client read like one of those darwin awards, but the only people to whom it would be funny would be other unix sysadmins - of that I am certain.

It certainly wasn't funny to the client.

What else is there. Mailing list software? Well, I'm assuming that good mailing list software is run by competent sysadmins, who have an SMTP server that actually obeys the SMTP specifications. Under these circumstances, their email, in the thousands, would never trigger the multiple-disobedience radar.

If they did, then people would need to whitelist them - and, as I mentioned above, that's very common for people who use greylisting anyway.

Another idea to extend the usefulness of greylisting

Perhaps one of the most useful ideas which could be incorporated into the greylisting daemon with these extensions is to utilise some of the techniques that Internet Security Systems proprietary RealSecure (tm) and other free software intrusion detection software uses: notifications.

The idea is that if a particular "event" occurs often or very frequently, then you send a notification event to the sysadmin. It would be incredibly useful to have a notification command which can be run if the 'count' of input from a particular IP address reaches an intermediate threshold before reaching the 'cut-off' threshold at which the sender's SMTP server is put onto the 5-hour-queue.

Conclusion

Greylisting, which is performed at MTA time (not after the message has been delivered, but the very first thing) is very effective against spam-bots that do not obey the SMTP RFCs on 450 "Please Try Later" responses. If, however, our spam bunnies get a brain between them that they can actually get to work in its jar, then greylisting unfortunately becomes much less effective.

We have shown, above, however, that there is a way to extend the concept of greylisting to get a much more rapid response, by utilising similar techniques that pyzor does - but just on the IP address, sender address and recipient address, rather than the entire message and/or its headers.

We also, unfortunately, have demonstrated another instance where the monoculture of Windows is, like bacteria and yeast, producing so much toxic material that it's killing its own environment. The question is: will people learn? (yep - looks like it).

Special thanks to Phil Hands for the random discussions and his great ideas, without which this article would not have happened.

Extension of Greylisting to kill Spam: Distributed 'Approval

Posted 25 Feb 2007 at 23:34 UTC by lkcl

SPF is "anti-fraud", not "anti-spam", posted 26 Feb 2007 at 15:01 UTC by Pizza » (Master)