Tuesday, April 17th, 2007

Be Careful: Robots.txt Is Case Sensitive

One of our clients produces and distributes video and audio-based stories for a variety of major companies and organizations. Each story can potentially be distributed via two different types of web sites - one or more media-facing sites (with or without client branding) and/or a public-facing site. To avoid duplicate content issues they use a robots.txt file to block access to the media-facing sites which looks like this:

  User-agent: *Disallow: /home.aspx?Story=Disallow: /clienthome.aspx?Story=Disallow: /playcontent.aspx

So I thought they were pretty well protected until I saw the following in a SERP this morning:

Example of robots.txt capitalization problem

Suddenly I realized robots.txt is case sensitive (notice the first url - it's the lowercase equivalent of something which is blocked in robots.txt). Needless to say the client has been alerted and a new robots.txt will be up shortly.

A Better Solution

But I have to say this is a bit lame. Sure, I know the W3C has said that URLs can be case-sensitive, and the search engines do tell people that robots.txt is case sensitive (see Vanessa Fox's post on the sitemaps blog), but the fact of the matter is that Google and others know that certain operating systems are case-insensitive and so you'd think they'd put two and two together. If a server identifies itself (which it does in every response) as running an operating system they know is case-insensitive, then they should treat the robots.txt file as being case-insensitive. If they're not sure, they could always request robots.txt and RoBoTs.txt and see if they get the same file.

IMHO, a robots.txt file should be handled pretty conservatively - case insensitivity should be assumed except perhaps when the OS is known to be case-sensitive. I've written elsewhere about how googlebot won't crawl a site when it gets a timeout when requesting robots.txt. This is really good, conservative behavior. But in the case cited above it seems like they're splitting hairs to get pages into their index which shouldn't be there. They know the web server OS is case-insensitive, so why are they treating the robots.txt file as being case-sensitive?

Yes, a case-insensitive server could proxy a case-sensitive server or run a case-sensitive web application, but the point is, when in doubt, robot exclusions should always be handled in a way in which the exclusion is assumed. It's about the likely intention of the webmaster - in this case that should be done in a way which errs on the side of not crawling the page.

This Can Be Exploited by People Who Don't Like You...

You might be wondering why all of this is important. Let's say you put up a site and have some content you don't want indexed, so you set up a robots.txt file and think you're fine. Along comes a competitor who wants to cause you problems, so they figure out the URLs you're trying to exclude and they put links to versions of those pages with different capitalizations. So for example, your URL:

They link to as:

And they can keep changing the capitalizations on you - every time you catch one and exclude it they can link to another URL with slightly different capitalization.

Yes, you can set up your server to combat this problem, but it's beyond the expertise of most webmasters.

"Use 2 Condoms Just to Be Safe"...

If you're worried about this the other way to keep your pages out of the search engine indexes is to use a robots meta tag. It will keep them from indexing the page even if the exclusions in your robots.txt file fail. The following will keep them from indexing the page and from following any links they may find on the page (to other private pages).

  <meta name="robots" content="noindex,nofollow" />

I'll also be suggesting that approach for my client's site...

[And for the record - in the bedroom, using two condoms is riskier than using just one...]

Tags: , ,
Categories: Spiders/Bots

Leave a Reply