Spam is really a problem in many Wiki communities, often forcing at
least temporary to restrict editing rights. Most of the recent attempts
to find a solution focus around captchas and spam lists. Captchas may be
efficient to some extent; the problem is that to make them unreadable
for bots, they must be twisted enough to become also difficult for
humans to read. Lists seem less and less efficient, often accumulating
thousands
of entries and still leaving enough gaps for spammers. Spammers
frequently use the Wiki search box to check if there is already some
spam on the site - this shows that Wiki may be purely maintained and
they can add more. Hence it may make sense to implement the delayed
indexing but it also delays indexing of legitimate content. Blocking IP
addresses is also no longer useful due DHCP.
One of the solutions may be to use combined protection rather
than
relying on some single "killer" approach. The rationale is to make
spammer to invest more and more work into building the spam bot.
Requiring a complex bot does not make the attack impossible but may
statistically eliminate significant percent of spammers that are not
willing to invest enough resources.
While maintaining our site (ultrastudio.org), we observed that
significant percent of spam can also be stopped by relatively simple
means that, to our surprise, were missing in JAMWiki 0.8.4 we use
(before we added them) so may be missing in many other Wiki engines as
well. If you work with the source code, there are following extensions
that can be added to basically any Wiki engine that edits through the
web form:
1. When processing edit form, check request type and require
POST
(bot
that uses GET is much easier to implement). This may look funny, but
really there are some wandering bots that periodically try to post spam
links as new pages using GET request.
2. The edit session is always a three page session: the user
visits
the
viewing page, then gets the edit page by following edit link and then
submits the edit page. Tie these three pages through cookies or other
obvious means. Again, the bot that needs to put one request, understand
response and submit another request including data from the previous
reply is more complex to write.
3. Set the minimal duration of edit session, especially if the
multiple
edits follow in rapid succession. Human will need at least few seconds
for the edit and about the same time to start another edit. A bot
frequently tries to edit different page every quarter of the second,
making possible to auto-discover and auto-block it automatically.
4. Check the order of the fields and overall structure of the
HTTP
header and verify if the browser identified as the user agent is likely
to produce such request. Reject edit calls of clearly non-browser
origin. This forces the spam master to abandon simple web access
functions, present in standard libraries of many languages.
The protection of this kind only eliminates relatively simple
bots:
surely it is possible to write a bot that would bypass it. However, from
my experience, simple bots make a significant percent of all bots, and
avoiding them allows to save a lot of resources. At least in my case,
the amount of worries dropped by the order of magnitude, freeing a lot
of time to work on content instead.