Public Resource republishes many court documents. Although these documents are all part of the public record and PR will not take them down because someone finds their publication uncomfortable, PR will evaluate and honor some requests to remove documents from search engine results. Public Resources does so using a robots.txt file or "robot exclusion protocol" which websites use to, among other things, tell search engine’s web crawling "robots" which pages they do not want to be indexed and included in search results. Originally, the files were mostly used to keep robots from abusing server resources by walking through infinite lists of automatically generated pages or to block search engines from including user-contributed content that might include spam.
The result for Public Resource, however, is that PR is now publishing, in the form of its robots.txt, a list of all of the cases that people have successfully requested to be made less visible!
In Public Resource’s case, this is is the result of a careful decision; PR makes the arrangement clear in on their website. The robots.txt home page also explains the situation saying, "the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use,", and "don’t try to use /robots.txt to hide information."
That said, I’ve looked at a bunch of robots.txt files on websites I have visited recently and, sadly, I’ve found many sites that use robots.txt as a form of weak security. This is very dangerous.
Some poorly designed robots simply ignore the robots.txt file. But one can also imagine an evil search engine that uses a web-crawler that does the opposite of what it’s told and only indexes these "hidden" pages. This evil crawler might look for particular keywords or use existing search engine data to check for incoming links in order to construct a list of pages whose existence is only made public through a file meant to keep people away.
Check your own robots.txt and ask yourself what it might reveal. By advertising the existence and locations of your secrets, the act of "hiding" might make your data even less private.
this is fascinating! thanks for the pointer.
Mako, this is already a solved problem. See
http://danielwebb.us/software/bot-trap/
Thanks for the pointer Joe. That’s great! Unfortunately, the folks who are careful enough to find and install bot-trap are probably not the people misusing their robots.txt file as a security system.
I would argue that, of course the solution achieves absolutely no security, but still it reduces the exposure in most cases.
I don’t think it can be said it reduces security, even though at first sight it kinda looks like an advertising dashboard.
If I had some content on PR which I didn’t want to show up in web searches, there is a chance I would still be quite satisfied with this solution.
– if my file is not listed in robots, it comes up in web searches on my name or other sensitive keywords – anyone looking for information on this topic will easily find it.
– if my file is list in robots.txt, it can still be found by people which intently look for my dirty secrets. Their life is maybe made slightly more difficult in a unsignificant way, if they don’t have a technical background, but no easier.
– only people which generally looks for any dirty secret are really helped by the robots.txt. Those people might not be more specifically interested in my dirty laundry, somehow their attention is diluted
You can also search for robots.txt files, try
inurl:/robots.txt with Google. If you have an idea which path names you’re looking for, this feature comes in handy.