Thursday, July 5, 2007

SEO_Wordpress: Plugin to maximise search engine positioning in Wordpress

Wordpress, like so many other CMS (content management systems) has a huge problem with duplicate content - to understand this I need you to think like a search engine spider. So, I want you to close your eyes and imagine yourself as a spider - breath deeply, close all 8 eyes and count backwards from 10… (if you don’t have time to read the full article you can download the plugin now by clicking here or read about my recent interview with Google here).
Robot (Spider) Fundamentals

Ok folks - are you all zenned into the spider frame of mind? No? Well, to help you out, I’ll give you a couple hints about life as a search engine robot (spider):-
I only have a limited amount of time to visit your site.
I usually (but not always) arrive via your index page.
My job is to look over the page I arrive at, save any content I see and then send it ‘back to base’, then follow any links on that page and do the whole thing again.
Wordpress is not Spider Friendly

Wordpress has a fundamantal flaw - it’s designed for humans (WOW! That’s a concept!) , so wordpress tends to make life difficult for spidey - wordpress puts the same content in lots of different places. Take this post for example - This exact same post will be able to be found in numerous spots on my site:-
It will be found in the SEO Tools ‘category’ (and any other categories I have it in)
I’ll be able to find it in the monthly archive for June, 2007.
I’ll be able to find it on the index page (http://www.utheguru.com) for at least a little while (after which it will gradually sink deeper and deeper into the bowels of the site).
I’ll be able to find it in the RSS feed for my site.
It will be available in the form of a trackback.
Last, but not least, it will be available as an individual post.

So, we have THE EXACT SAME CONTENT replicated all over the place - this problem is known, suprisingly, as duplicate content.
Telling the spider ‘where to go’

Why is this such a problem? Well, put on your spidey thinking cap again - first off, rule number one - spider has limited time. If you check your server logs, you’ll see that the spider only crawls deeply (spends more than a few seconds traversing your site) about every 7th visit - for me, that means about once a week, for other smaller sites, you might only get ‘deep crawled’ every month or so. Google usually only crawls the front page, a few of your newer posts and pages that other sites have linked to - so we need to make the most of our opportunities. Matt Cutts (the head of the Google Webspam Team) talks about ‘Herding the Bots‘ on his blog, which should give you an idea just how important this is. In short, Matt describes various ways of telling the bots what you consider important pages using tools such as Robots.txt, rel=nofollow and something called the “meta noindex” tag.
Staying out of the ’supplemental index’

Why is herding the bots so important? Well, another prominent (ex) Googler, Vanessa Fox gives us a hint, I quote:-

“The question I got most often after the session was about the supplemental index. Does having duplicate content cause sites to be placed there? Nope, that’s mostly an indirect effect. If you have pages that are duplicates or very similar, then your backlinks are likely distributed among those pages, so your PageRank may be more diluted than if you had one consolidated page that all the backlinks pointed to. And lower PageRank may cause pages to be supplemental.”

A supplemental page is a page that isn’t as likely to appear when someone does a search for something you’ve written about - you can read heaps more about the supplemental Index and Bot Behaviour on my post about how to get out of the supplemental index. Duplicate content is something you should try to avoid if you want your pages to stay out of the supplemental index.
My Strategy - a combination of robots.txt and noindex

So - how do we avoid this problem of duplicate content and make our wordpress inherently more search engine friendly in one fell swoop? Well, first of all, we start with robots.txt. A robots.txt file tells search engines what they should and should not index. In the case of wordpress, I really don’t want versions of my articles in trackbacks, rss feeds, or archives to be indexed - so, I block them using the following robots.txt:-
User-agent: *
Disallow: */trackback*
Disallow: /wp-*
Disallow: */feed*
Disallow: /20*
User-Agent: MediaPartners-Google
Allow: /

Ok - cool - so, now, when googlebot (or any other robot) crawls my site it doesn’t go near any of those locations (except for mediapatners-google - that’s the adsense bot - we want it to be able to see all pages so that it can make well targetted ads) - so we’re immediately herding Googlebot to the remaining three sources of duplicate content:-
The copy on the index page (ie, on http://www.utheguru.com).
Our main copy (http://www.utheguru.com/seo_wordpress-wordpress-seo-plugin).
The copy (or copies) in the category pages (ie the copy at http://www.utheguru.com/category/seo/seo-tools).

Of these three, we really only want the first two - so, we could potentially robots.txt out the category pages - but that would be a bad idea. Why?
Wordpress posts tend to ‘age’ quickly

Posts fairly quickly disappear off the main page as time goes by, but they remain in the categories page longer - If we were to robots.txt out all category pages, we’d run a fairly high risk of having them disappear from the index altogether - googlebot would no longer be able to easily find them and would assume they’d been lost forever. The solution? the meta noindex command - if we add the following command in the section of our category page we’ll tell googlebot that we want it to follow all the links on the category pages - but not actually put the category pages in the index - in essence, herding the bot to our content pages.
meta name=”ROBOTS” content=”noindex,follow”
The SEO_Wordpress plugin

Ok - so if I’ve done my job right, you should be totally confused by now. DO NOT DESPAIR - the good news is that I’ve written a plugin (based upon one called DupPrevent) that does this all for you. Using and installing the plugin is simple - just download it by clicking here , drop it in your wp-content/plugins/ folder and then activate it using the ‘plugin’ tab in your wordpress admin panel.

NOTE: I realise that some people like to make their own changes to robots.txt. If that’s the case for you, it’s fine. If you have a custom robots.txt, the plugin detects that and will skip the the robots.txt changes so you can make them yourself. If you’re unsure about robots.txt syntax, or anything else I’ve discussed on this page, the Google Webmaster Help Team has put together the following great FAQ.

Extra Note: I had a couple other questions from readers:-

What if I already have a robots.txt?

In the situation which you describe, the plugin will detect there is an existing robots.txt, and will let it be. Any existing robots.txt takes precedence over the plugin by default.

What does the /20* mean in the robots.txt you describe?

the /20* means block any pages that start with your domain name (and a forward slash), immediately proceeded by a string starting with the digits 20 - have a look at one of your archives page - the stock standard archive page goes something like this - http://www.utheguru.com/2007/06/ - so this blocks all archives between the years 2000 and 2099 - I think that should be sufficient

What is the ‘head’ section you describe?

The “head” section is an invisible part of your page that gives browsers (and search engines) information about the page - if you are using firefox, you can hit ctrl+u (it might work on the other browsers too) to show the source code of your page. You’ll see the meta robots code is inserted in there.

On that topic - if you like playing with wordpress, I’d suggest you get a firefox plugin called ‘Web Developer Firefox Plugin’ - it’s a great way to easily play with your css files and has heaps of other tools - but you will need to get the latest version of firefox (below) for it to work.
based from http://www.utheguru.com/seo_wordpress-wordpress-seo-plugin

No comments: