13 Feb 2012 mako   » (Master)

Quasi-Private Resources

Public Resource republishes many court documents. Although these documents are all part of the public record and PR will not take them down because someone finds their publication uncomfortable, PR will evaluate and honor some requests to remove documents from search engine results. Public Resources does so using a robots.txt file or "robot exclusion protocol" which websites use to, among other things, tell search engine's web crawling "robots" which pages they do not want to be indexed and included in search results. Originally, the files were mostly used to keep robots from abusing server resources by walking through infinite lists of automatically generated pages or to block search engines from including user-contributed content that might include spam.

The result for Public Resource, however, is that PR is now publishing, in the form of its robots.txt, a list of all of the cases that people have successfully requested to be made less visible!

In Public Resource's case, this is is the result of a careful decision; PR makes the arrangement clear in on their website. The robots.txt home page also explains the situation saying, "the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use,", and "don't try to use /robots.txt to hide information."

That said, I've looked at a bunch of robots.txt files on websites I have visited recently and, sadly, I've found many sites that use robots.txt as a form of weak security. This is very dangerous.

Some poorly designed robots simply ignore the robots.txt file. But one can also imagine an evil search engine that uses a web-crawler that does the opposite of what it's told and only indexes these "hidden" pages. This evil crawler might look for particular keywords or use existing search engine data to check for incoming links in order to construct a list of pages whose existence is only made public through a file meant to keep people away.

Check your own robots.txt and ask yourself what it might reveal. By advertising the existence and locations of your secrets, the act of "hiding" might make your data even less private.

Syndicated 2012-02-13 10:41:18 from Benjamin Mako Hill

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!