Older blog entries for FarcePest (starting at number 2)

You now have the opportunity to join the most extraordinary and most powerful wealth building program in the world!

Typical morning:

I am the comstar.net Spam Disposal Unit.

My day usually starts with a new pot of coffee. Unfortunately it was on all weekend, so I need to wait for it to cool down. I really need to get one that shuts off on it's own.

After logging in, I check my inbox, and then switch to the relays folder. 64 new relays. For a monday, that's fairly typical.

comstar.net is a business ISP, which among other things mean, we don't have dialup services (excepting ISDN). Some of our customers are dialup ISPs, though. But anyway, part of this is mail hosting for a couple hundred different domains. Which means, we get spam. Most of it seems to be for comstar.com, which is one of ours, but there are hardly any users within that domain. However, some spammer, somewhere along the line, got the idea that this domain has a million users in it. Part of the problem here is we use qmail for the MTA, and qmail-smtpd doesn't check recipients during the SMTP session, except to make sure that it's for a domain it should accept for. So from a sender perspective, all those recipients seem to exist.

So the general scenario goes like this:

  • Spammer sends to some 10K psuedo-random addresses, probably generated from some list of common names.
  • We attempt delivery on them.
  • Nearly all of them bounce.
  • The envelope sender is fake, of course, so they bounce again.
  • The double-bounces go into my spamgrab script.
This is where the fun begins.


The spamgrab script (mostly procmail) starts out by getting the original bounced message. With the qmail bounce format, this is pretty easy. sed does the job nicely.

Next it finds the IP of the host we got the message from, and it compares this against a cache. The cache entries stay around for a week, but after 24 hours they expire. That sounds a little contradictory. There are really two tests against the cache: The first test checks to see if the IP is present (up to a week). If it's there, it's a known spam host. The second test checks to see if a report has been sent within the last 24 hours. Messages from hosts that are in the cache are sent to /dev/null after any reporting.

The host might not be in there at all, of course. However, we also employ an RBLCheck script that runs just before qmail-smtpd. This checks against ORBS, RSS, RBL, and DUL, and tags the headers to indicate which lists that host is on. It does some other fun things as well; more on that later.

The tagging is for the benefit of the spamgrab script, when the mail eventually double-bounces. The spamgrab script looks for these tags, and generates a report if they are there, under certain conditions.

NEW Stock Holders and Investors Alert - for April 7

I haven't said much about the reports yet. Due to the cache, reports are only sent for a given host once every 24 hours.

If the host is on DUL, it sends a spam complaint (original message only) to the host's ISP, using the abuse.net database.

Otherwise, it's assumed to be relay spam. This generates a detailed relay spam report (including the entire original double-bounce) to the ISP's abuse department. It also generates a relay report, saving it in my relays mailbox. The relay report is for ORBS and/or RSS, avoiding reporting to lists that it is already on. Later on I go through and inspect these, and pump them back through the script so that they are actually mailed out.

One time, a spammer sent us the same spam through at least 300 different relays, twice on the same weekend. (600 total.) But since most of these were listed on ORBS, the spamgrab script sucked them all up, reported the relays to their ISPs, and generated reports for RSS. On average, though, I only generate about 3000 relay reports a month. Most of those are unique.

Hello Natural Health Enthusiast,

Now I know what you're thinking: If I'm using ORBS, RSS, RBL, and DUL, why do I have any spam to bounce?

Answer: Because I have leaky spam filters, by design.

  • If the host is on ORBS and either RSS or RBL, we refuse the mail at the SMTP session.
  • If the host is on DUL, it's throttled: Additional recipients after the first get a temporary failure code. In the Battle of the Bandwidth, DS3 beats V.90 any day.
  • If it's just on RSS, it's temporarily failed about 90% of the time. The other 10% of the time, it gets through.

The leaky filters are what enable me to send so many relay reports. If I blocked on ORBS directly, I'd wouldn't have spams to send to RSS. Besides, a lot of ORBS hosts aren't yet abused by spammers; remember that I only bother with the double-bounced spams. But once they are on both, I don't need or want 'em. RBL I just don't trust that much; their policies seem too erratic. But I will block on RBL if there's an ORBS listing. ORBS at least has an objective criterion: Does the host relay, or is it the smarthost for another relay? RSS is a little different: Does the host relay, and has it relayed spam? I never liked the idea of blocking all dialup connections. It's a bit unfair to Linux users who actually can run a real MTA.

But the leaky filters are just the beginning. I log all these incoming connections, and there's another script that finds the worst ones for the most recent period. Those hosts, the ones that are connecting the most and are spam-listed, get put on the firewall for awhile.

On a typical weekday, we refuse something like 70% of the incoming connections. On weekends, this goes up to about 95%.

Home Improvement Loans Here

What spam does get through, and past the spamgrab script, goes in my spam box. I sort these by size, look for clusters, pick a likely candidate, select a unique string ("waste your time", "university diplomas", "international driver's license"), and then pump those back into the script with an option that tells it: This is relay spam. This forces it to generate reports.

We (I) would be completely swamped without all this, and it's evolved over time to the point where it's gotten pretty efficient. It would be tough to do this with sendmail. qmail's modular design makes it relatively easy.

And I haven't told you about smeat yet... :)

gstein certified me. I ended up browsing his Python pages and found a cool way to make Python byte-code files (.pyc) executable. Then I thought of a cooler way to do it. Or at least, if you want to make your .pyc files executable, this is an easier way to go about it.

P.S. 10-Apr-2000: It turns out that it only works if you directly reference the Python interpreter on the #! line, i.e. my examples with /usr/bin/env python don't actually work. I've since fixed the page. Thanks again to argent for some discussions that led up to this discovery.

Released MySQLdb-0.2.0. A lot of changes in this version, and most of them are non-obvious. Which is a good thing, right?

There's now a mutex in the standard Cursor class, which allows two threads to share a connection. Personally, I think sharing connections is a bad idea. For one, each connection is a seperate thread in mysqld, so by sharing a connection, you don't get to take advantage of multiple CPUs, or doing something while another operation blocks. Second, transactions are coming to MySQL, or so I hear, sometime in 3.23. It seems likely that transactions will begin and end on a per-connection basis, i.e. in most database designs, the commit/rollback is done on the connection, not the cursor.

There are now multiple cursor classes, built using several MixIn classes. I won't list them here; they are in the documentation. But it's something like this:

  • return rows as tuples or columns
  • use client-side (mysql_store_result) or server-side (mysql_use_result) cursors (the latter you especially don't want to share connections with)
  • raise Warning or not

Or you can mix up your own class.

0.1.3 introduced some methods that were intended to create a little backwards compatibility with the older MySQLmodule: cursor.fetchXXXDict(). It didn't quite work out that way. MySQLmodule would set the keys to be "table.column", and my implementation used "column". I figure, if your columns aren't unique, you should just alias them with AS in your SQL. The people who wanted the old way of doing things seemed to use SELECT * a lot, which is usually a bad idea.

So then I got the bright idea: Set the key to "column" unless it already exists; otherwise set it to "table.column". This gets rid of most of the ambiguity problems.

The other thing I did which relates to backwards-compatibility is I got rid of nearly all of the keyword options on the db.cursor() method, namely use and dict. There is a new one, cursorclass, which you can set to your own cursorclass. If you are a freak for fetching dictionaries, you would now do db.cursor(cursorclass=DictCursor), and this gives you a cursor that returns dictionaries. cursor.fetchXXXDict() is deprecated; just use cursor.fetchXXX() on the DictCursor instance.

Breaking the Cursor class into a bunch of MixIns let me do another important optimization. By default, the old cursor class used client-side cursors (mysql_store_result). This means your entire result set is sucked into memory on the client side. Then the various fetch calls would return rows. When you deleted the cursor, only then would the MYSQL_RES be freed.

What it does now is: After it does the query, it fetches all the rows, and hides them in the cursor. Then it frees the MYSQL_RES. Theoretically, this cuts the memory utilization down. Queries are probably slower (for large result sets), but fetches now are literally just slices.

This also allowed the introduction of some non-standard methods: cursor.seek(offset[, whence=0]) and cursor.tell().

Another optimization: A new string literal function in _mysql that not only does mysql_escape_string but adds the necessary quotes in place. This should speed up string INSERTS, since not as many strings are being built up and torn down.

Because of the work to make connections sharable, you now have to call cursor.insert_id() instead of db.insert_id() to get the last inserted auto increment field value.

I wonder sometimes whether the MixIn classes for Cursors are really worth it. They might cost performance, since you need the BaseCursor class and three MixIns to implement the standard cursor. On the other hand, some program logic is removed, which is usually a good thing.

On the Zope front: There's a ZMySQLDA-1.1.4 out. The patch I have to make ZMySQLDA use MySQLdb/_mysql doesn't work on it. However, I do have a new version (culled together from ZOracleDA) that I haven't released yet. If you want to test it, let me know. It seems to work perfectly fine for SELECT; someone has told me it doesn't work for other things. I'm using it in an on-going project; I just haven't had to write to the database yet.

MySQLdb is the number one download at The Vaults of Parnassus, beating out stuff like Numerical Python and Zope. Wow. I guess a lot of people like MySQL. So due to demand, I've started to build Red Hat RPMs as well. The package is named MySQL-python so that it's a sub-package of MySQL.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!