blog

Friday, September 12th, 2008

Controlling Duplicate Content On A WordPress Blog

"Out of the box" WordPress makes some really basic mistakes when it comes to duplicate content. It's not necessarily the fault of WordPress, so much as the people who design themes for WordPress.

If you're not familiar with the concept of duplicate content the problem is when GoogleBot (or other search engine spiders) find the same piece of text in multiple locations they don't know which location is authoritative. They'll do their best to pick one, but it might not be the one you want them to pick. After all, "they're just dumb machines". OK, not so dumb these days, but things that humans see as obvious, they don't necessarily understand.

If we take this post as an example. It will show up on the main page of the blog for a while, then there will be the post page, but it will also show up on the daily, monthly, and yearly archive pages, plus it will show up on the page for each category or tag that I associate with the post. Depending on how I tag the post, that can be a lot of places which will just be confusing to the spiders who crawl the site.

When I first set up this blog I took the wrong approach. I was only worried about the issue of the home page and I put something along these lines in my VirtualHosts file (similar to an .htaccess file)...

RewriteCond %{HTTP_USER_AGENT} (googlebot|slurp|msnbot|teoma) [NC]
RewriteRule ^/blog/page/ /blog/ [NC,R=301,L]

What that did was redirect the spiders from the 2nd and subsequent pages back to the home page of the blog. That reduced a part of the problem, but not nearly enough of the problem. And it also created a different type of problem in that some pages weren't able to pass "link juice" because I stopped the spiders from seeing the pages.

My new approach is much more comprehensive. If you put the following code in the header.php file for your theme between the <head> and </head> tags you'll do far better than I was doing originally...

<?php if (is_day()) { ?>
<meta name="robots" content="noindex,follow" />
<?php } elseif (is_month()) { ?>
<meta name="robots" content="noindex,follow" />
<?php } elseif (is_year()) { ?>
<meta name="robots" content="noindex,follow" />
<?php } elseif (is_search()) { ?>
<meta name="robots" content="noindex,follow" />
<?php } elseif (is_author()) { ?>
<meta name="robots" content="noindex,follow" />
<?php } elseif (stripos($_SERVER['SCRIPT_NAME'],"page/")) { ?>
<meta name="robots" content="noindex,follow" />
<?php } elseif (isset($_GET['paged']) && !empty($_GET['paged'])) { ?>
<meta name="robots" content="noindex,follow" />
<?php } ?>

Let me explain what that does...

For starters, it lets the search engine spiders see all the pages of your site - it doesn't hide anything. That means every page on your site passes it's "link juice" appropriately.

However, by using "noindex" I've made it so certain pages don't get put in the search engine indexes. Those pages are:

  • Day archive pages
  • Month archive pages
  • Year archive pages
  • Search result pages
  • Author archive pages
  • Any page beyond the first page of anything

The reason why you want to exclude archive pages is because they lack a clear theme or focus and communicating a clear theme or focus for the page to the search engine is one of the fundamental goals of SEO.

You may think that spiders don't execute forms and hence wouldn't hit a search page, but 1) someone may have linked to a search results page on your site, and 2) spiders are starting to execute forms.

And pages past the first page for categories and tags are a bad idea for several reasons. For starters the content shifts from page to page. It'll start on the first page, but gradually move to the second page, and then the third page, and so on... Because the content isn't stable, they're not good pages for the search engines to index. You do want the general concept of the category or tag to get indexed - you just don't want every page in that category or tag to get indexed.

The stable pages are the post pages and that's what you want to guide the spiders to use.

Tags:
Categories: Duplicate Content

Leave a Reply

HOME · CREATIVE · WEB · TECH · BLOG