Advogato: The case against crawler918.com

Posted 13 Jan 2003 at 01:22 UTC by mbp

I saw some hits on my web site the other day from machines in crawler918.com. Always curious about new developments in web searching, I thought I'd find out about it. It's not a happy story.

Here's an example log line:

crawler1.crawler918.com - - [12/Jan/2003:23:46:38 +0000] "GET /problems.html HTTP/1.1" 200 6428 "-" "Mozilla/4.7" distcc.samba.org

To start with, it's obviously lying about being simply "Mozilla/4.7". Of course starting off the description with "Mozilla" to indicate capability is pretty standard these days. But UAs ought to say what they reallyt are, and RFC2616 says that robots should give a URL to find out about them. For example:

crawler11.googlebot.com - - [12/Jan/2003:23:15:00 +0000] "GET /manual/html/distcc-4.html HTTP/1.0" 200 2904 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" distcc.samba.org

Well, perhaps there's some information at crawler918.com? No, not a sausage.

Curious. Who's this robot poking around so anonymously? Nothing wrong with anonymity on the net of course, but it piqued my curiosity.

A quick grep showed that the robot was violating the robot behaviour standards! It never requested robots.txt. Very naughty.

Google and WHOIS show that crawler918.com is in fact owned by nameprotect.com:

NameProtect offers a comprehensive suite of research, watching and online brand monitoring services that assist brand professionals, attorneys, and other Intellectual Property specialists in building, protecting and managing their brands in the digital world.

No thank you.

It looks like they're running this robot that sneaks in where it's not wanted with the aim of later bringing lawsuits against web site hosts. No thankyou. I can't see any way in which allowing access by this robot could ever possibly be of benefit to the web host, and its behaviour is dishonest and distasteful.

I'd be curious to see a criminal test case on the grounds of "unauthorized access to a computer resource" brought against people like this who bypass robots.txt, that being an access control mechanism. In Australia the penalty is up to ten years in jail.

More to the point, it's easily blocked. I encourage you to do this, at least as a gesture, even if they'll probably get new IPs in the future.

    Deny from 12.148.196.128/25

I'm not generally in favour of reducing connectivity, but I don't see any reason to invite hostile ambulance-chasers into my home.

More info: 1

"I never undertake to instruct my enemies" -- Rockefeller

Thanks, posted 13 Jan 2003 at 01:33 UTC by djm »

# tail -1 /etc/pf.conf
block in log from 12.148.196.128/25 to any

Won't be worrying about them.

blacklist, posted 13 Jan 2003 at 12:23 UTC by tma »

we definitely need a blacklist for crawlers too, not just SPAMmers.

blacklist, posted 13 Jan 2003 at 12:23 UTC by tma »

we need a blacklist for crawlers too

What we need is a way to share these lists among our peers, and have aggregate validation of the hosts stored in these lists, similar to RBL (but for blocked hosts and on a distributed basis), for things like spammers, Microsoft's virus-o-the-week (Nimda, CodeRed, et al), people banging down websites with abberant spiders and crawlers (crawler918, crawlXX-public.alexa.com, etc.), and other similar abuses.

Having a proper way to add and remove them from the "master list" which is distributed to every person who wishes to get it would be a primary goal of course. Something like distcc (hint hint) for distributing these block lists.

I probably have about 3,000 spammers listed in my iptables lists, plus abusive domains, spiders, crawlers, and clients who are knowingly are unknowingly infected with one Microsoft virus/trojan or another.

Another nice feature would be a way to automagically report these abuses upstream as they happen, similar to the way EarlyBird does it for worms.

I've been thinking about this for awhile, but I just don't have the time right now to code up a working skeleton prototype. Anyone else?

I don't think you can describe robots.txt as an access control mechanism, it's an advisory only. (Any mechanism that relies on the client to do the access control can be at best advisory).

That said, I don't think what they're doing is right, and they've become the first entry in my .htaccess file.

So "advisory access control" is a funny thing in computer security, because computers are essentially always only doing what their programmers tell them. It's not possible to remotely break in to a computer in the way one can force the lock on a car or house. "Cracking" is just sending requests which the machine obeys, even though the owner didn't intend it to.

If a web page asks a Win95 machine to replace its kernel with an image of the goatse guy, is the web page really at fault? It only made the request. If the Windows machine followed it, that's between the owner and Microsoft -- or so some people would say.

If a web site's access policy, or "click through licence", says that robots can only access the site in accordance with the robots.txt file then that would seem to be as reasonable as any other click-through licence. I think the law is such that the operator just needs to make it reasonably clear that the access is not allowed, rather than making it technically impossible.

Obviously I don't really care enough to bring a lawsuit, but it's interesting to speculate.

re: blacklist, posted 20 Jan 2003 at 11:14 UTC by Alquimista »

I use to frequent www.webmasterworld.com; that site has a forum dedicated to find about web spiders, and they have a very through list of evil bots.

Other means, posted 27 Jan 2003 at 11:23 UTC by RoUS »

That's not the correct Deny range. Look at <URL:http://ws.arin.net/cgi-bin/whois.pl?queryinput=!%20NET-12-148-209-192-1>; the correct range is CIDR 26, not CIDR 25. By using 25 you're blocking more than just these bozos. So:

Deny from 12.148.209.192/26

In addition, when I first saw them hitting my site, their User-Agent was NPBot-1/2.0, so add

SetEnvIfNoCase User-Agent "NPBot" evil=1
Deny from env=evil

There are technical means for making crawler918 see the error of its ways. They must have some means of detecting infinite loops, and moving on. The spambots I've seen tend to grab a handful of pages and then move on regardless. Effectively, they are timeslicing between many indefinitely large tasks.

Putting some equivalent of wpoison (non-free) or tramspap (GPL, has self-DoS problems. I'm plugging my own code, sorry) might have the same futile but satisfying effect on 918 as it has on the spammers.

There is a (very long) Debian ITP against these two, I joined the fray near the bottom and a couple of people have shown interest in the last few months.

If I found or became a hardcore Apache module h4x0r then I could probably fix the self-DoS problem in tramspap... there are probably simpler solutions too (first CGI up becomes the daemon which monitors for self-DoS and then calls the game off).

(Trust me to post after the article falls off the front page)

better, posted 3 Mar 2003 at 03:18 UTC by mbp »

(For completeness, although this article is pretty old.)

It looks like NameProtect has cleaned up their act in three important ways: they include a URL with further information, they (claim to) respect robots.txt, and they use a proper agent string.

I wish they'd done the right thing in the first place, but better late than never.

There's some more information about it here.

You might still question whether offering service to NPBot is ever in a webmaster's best interests, but at least now they're playing within the rules.