Thursday, September 18th, 2008

Getting Started with SEO

When people start to learn the web they hear pretty quickly that they should "optimize" their pages. But a surprising number of people, even some big bloggers, don't do it... I'm not sure the reasons, but I have a feeling some people find the process intimidating, while others just feel "Google should figure it out". Either way, people who don't do even basic optimization of their sites will consistently perform poorly compared to their competitors... If you're going to spend the time to build a website, take a little more time to make it have a bigger impact.

The things I discuss here are all pretty "safe". Some people who do SEO recommend tricks that can result in penalties. What I mention here is all 100% safe...

So what is SEO? SEO stands for Search Engine Optimization. The goal of SEO is to get more traffic from search engines like Google and Yahoo!

If you only understand one thing about SEO remember this... Your page is being evaluated by a piece of software called a "spider" or "bot". Since it's not being evaluated by a human being, all the intuitive conclusions a human would draw about your page, the spider doesn't "get". They're far more literal in how they interpret what they see. So the absolute most important thing about SEO is to be literal and clear. If you ever have an SEO question you don't know the answer to, ask yourself what would be most literal and clear to a spider and you'll probably have the right answer...

It's also important to think like a spider and understand that spiders don't trust you very easily because there are so many "web spammers" out there who are trying to trick them. They'll do things like putting white text on a white background to have text a human won't see, but a bot will see. Or they'll have an image with text in it and have an alt tag that has different text. The tricks spammers play are numerous, but just remember "literal and clear" and you'll be fine...

Rule One: Only visible text counts

When you talk to people about SEO if they start talking to you about "meta keywords" just ignore everything else they say. The rule of thumb is that only visible text counts to spiders. Too many spammers were packing too much useless text into parts of the page no human would see, so hidden text gets virtually recognition by the search engine spiders.

Rule Two: Nothing is more important than the <title> tag

Take a look at a SERP (Search Engine Result Page) in Google...

The elements that are most clear and visible in the SERPs are the titles of each page that's listed. That text comes directly from the <title> tag in the HTML on your page. So pay close attention to what you put there and think about how it will look to the user when they see it in a SERP.

It's surprising how many sites get <title> tags wrong. Take a look at how YouTube does not take advantage of title tags...

You can tell by the URLs what the pages are about, but not by the titles. Obviously if YouTube had title tags like "YouTube: CSun50 Video Group", "YouTube: Syria Video Group", and "YouTube: ChineseValentine08 Video Group" they'd do much getting people to click on their links. Luckily for them, Google has written special rules to integrate YouTube's video content into the SERPs. Your site won't have any special rules, so you need to pay attention to your <title> tags!

Rule Three: Write in a clear, direct manner

The text on the page is incredibly important. Inherently spiders are textual, however they're also sorta dumb because they're looking for things that people are looking for. So if you wrote

The auburn bovine achieved the incredible feat of circumnavigating the rock in orbit around the 3rd planet from the sun.

It wouldn't be nearly as effective as

The brown cow jumped over the moon.

Why? Because brown is searched on more than auburn, cow more than bovine, jumped more than 'incredible feat of circumnavigating', and moon more than 'the rock in orbit around the 3rd planet from the sun'.

You also have to ask yourself which has keywords that best describe the concept? Cow, jump, moon are, by far, the best and most likely keywords someone might type in if they were looking for something on that topic...

Let's take another example from a book entitled "Panic Encyclopedia" (p. 132)...

Ralph Lauren for Polo Dungarees uses the rhetoric and imagery of the "natural" in the context of the overtly simulated to appropriate not only the style of New Left politics but also that of the eu-jean company Levi Strauss.

It's hard enough for me to figure out what that sentence is saying. A spider would probably be even more baffled, though it would pick up keywords like "Ralph Lauren", "New Left politics" and "Levi Strauss". Still the writing style is esoteric and unclear.

Rule Four: Link text is incredibly important

Imagine your a spider and you're trying to understand the following links... They all point to the same thing, but which do you think does the best job?

  1. Video of Ralph Nader talking to a parrot
  2. Ralph Nader talks to a parrot (see video)
  3. Ralph Nader talks to a parrot

Can you tell a difference between the 4 links? Which is best? This is where you need to balance SEO with other factors... You could say which is most likely to get the person to click and see the video? But in SEO the goal is to get organic search engine traffic to the page you're linking to... Let's go through the examples...

  1. The link is on text that describes quickly and concisely what the destination page is about. From an SEO perspective this is the best, but it may not be the link that is best at getting people to click and view the video.
  2. The linked text "see video" does not describe what the target page is about, however it's adjacent to text that has good keywords. The only thing this one has going for it is a "call to action" in that it directly tells the viewer to do something ("see video").
  3. You may not be able to tell the difference between 3 and 4, but there is a big difference. #3 has what's called "alt text" which describes the image both in the form of title="Ralph Nader talks to a parrot" and  alt="Ralph Nader talks to a parrot". In addition the image is named parrot-nader.jpg, so you have keywords in the file name as well. If you find you want to use an image to link to another page - that is how it should be done. It's not as good as text since spiders don't trust alt text as much as real text on the page since alt text is frequently abused by spammers.
  4. The last link is, by far, the worst. There is no link text, no alt text, and the file name is rnpar.jpg which doesn't tell the spider anything either.

So, from an SEO perspective, keyword rich textual links are by far the best way to go. If you do find yourself needing to use images, make sure you have alt text to go along with the image so there's a textual equivalent for the spider to understand what you're doing. And yes, spiders can find links even in Javascript nav bars and inside Flash, but none of those links will ever be quite as good as the simple text link.

And don't just think about links to and from other sites. The internal link structure of your site is important as well...

Rule 5: Every page should have one primary concept or theme

Spiders think in terms of pages and their goal is to understand the theme of every page they encounter. The more you have a clear focus to your pages, the better. Even pages that don't seem to have a clear focus can have a clear focus. Take the home page of the NY Times as an example. It would seem to be a page with many topics, but notice what's in their title tag - "Breaking News, World News & Multimedia". The multimedia part is a bit weak, but breaking news is exactly describes the theme of that page. "Top news stories" might also have been a good choice...

A bad example would be all of the Guide To The Tube posts that Towleroad (a prominent blogger) puts out - he has nearly 7,000 of them indexed by Google. They may be indexed by Google, but because he'll put 10 or 12 themes on the same page, they wouldn't perform nearly as well as 10 or 12 separate posts.

Rule 6: Avoid duplicate content

Duplicate content is when the same exact content appears on more than one page. This can happen for many reasons. Take the example of this blog... This post will appear on it's own page as well as on category and tag pages, in addition to being on the home page of the blog for a period of time. Or take an online store that lets you change things like the sort order or the number of products displayed on a page. There could be many, many different pages that are all basically the same.

Now imagine your a spider trying to figure out which is the best, most authoritative URL for a given chunk of text. While they get it right sometimes, more often than not thye get it wrong. The best thing to do is to make it clear which version of the page they should index. There are two ways you can do this...

Controlling duplicate content can be tricky to control for a beginner. This blog has quite a few posts just on the issue of duplicate content. I'd recommend starting with the post that runs down the different types of duplicate content and what can be done about them.

What's important for the beginner is to be aware of the problem and seek help to resolve issues when you encounter them.

The most common tools to fix duplicate content issues are:

Meta Robots

A meta robots tag is a tag you put in the <head> section of your document that can tell the spider whether or not the page can be indexed. It typically looks something like this:

<meta name="robots" content="noindex,follow" />

That tells the spider they can evaluate the page and follow the links on it, but they should not put the URL in their index. The advantage of using meta robots is that the page is able to pass on 'link juice' to other pages without being a problem itself. The downside is that the spider may crawl many similar page and put a fair amount of load on your server.

My recent post on controlling duplicate content on a WordPress blog relies exclusively on meta robots.


Robots.txt is a file that you put in the root directory of your site. It has statements in it that completely block spiders from crawling certain pages on your site. If you want to see a moderately complex robots.txt I suggest looking at the robots.txt file for Robots.txt can also be used to bar spiders from crawling semi-private sections of your site, so it has many uses.

Robots.txt is a pretty simple file conceptually. You specify a "user agent" (how browsers and spiders identify themselves on the web) and what that user agent is not allowed to see...

To block all spiders from visiting the site the robots.txt file would look like this...

User-agent: *
Disallow: /

The asterisk means everyone, so you're disallowing everyone to see anything that starts with '/' - which is everything on the site.

If you wanted to stop just GoogleBot from crawling your 'private' directory, it would look like this...

User-agent: googlebot
Disallow: /private/

To find out how to use robots.txt on your site check out which also explains in depth the usage of the meta robots tag.

Apache Mod_Rewrite

By far the most advanced way to control duplicate content is to use Apache's Mod_Rewrite module. This gives you complete control over what happens and when it happens. This is not something beginners should try to use, so just know that it's possible and if you need it, find someone to help you.

Here's a pretty simple example... On these two URLs return the same page...

The reason is because the site is based on templates - instead of having thousands and thousands of pages there can be one template for each type of page. In this case the template is detail.htm and it takes the query string variantID=2983, but I wanted a friendlier looking URL, so I make it look like 2983.htm is a real document that exists on the server. What I've done is implemented a mod_rewrite rule that changes all the URLs I don't want into URLs I do want. In this case it looks like this:

RewriteCond %{QUERY_STRING} variantID=([0-9]+) [NC]
RewriteRule ^/image/detail\.htm$ [NC,R=301,L]

The first line checks to see if variantID= (followed by a number) exists in the query string. If it does, it takes the number and makes that the file name followed by .htm.

Those are the big items, but the following can also help with SEO and getting organic traffic to your site...

Use <h1>, <h2>, etc.

The HTML standards say that the header tag <h1>, <h2>, <h3>, etc. exists to define the organization of the page. The spiders like it when you use these items for major themes on your page with the most important themes getting more prominent header tags. So use <h1> at most once on your page (say, for the title). <h2> for the major sections, and <h3> and <h4> for sub-sections.

You can still style the header tags with CSS, but header tags are much better than a styled <p> tag.

[Yes, this post page doesn't use header tags, but that's because WordPress doesn't facilitate using them. If you use Dreamweaver they're quite easy to use and you should definitely use them.]

Use meta description tags - sometimes...

The meta description tag is pretty popular because it is often used in the SERPs directly below the page title. Yes, this violates Rule One that only visible text is important, but the search engines evaluate the text to see if it is similar in content to what's on the page and won't use it if they think you're spamming them by putting one topic in the <title> and meta description, and another topic on the rest of the page.

A meta description tag would look something like this...

<meta name="description" content="6 safe rules to follow when you get started in SEO" />

If this page had that meta tag chances are (not always) that would be what appears directly after the title "Getting Started with SEO" in the SERPs. The search engines do excercise discretion and won't always use it, but if it has something to do with the user's query they will usually use it.

You might think meta description tags are a wonderful thing and should always be used, but there's a downside to them... If you have the same description on all the SERPs the description the person sees may not be the most relevant for their question. Taking the example above. If they're looking for a simple explanation about duplicate content or meta robots, they wouldn't know it was covered on the page. Without meta description the search engine will pick the best text from the page for their particular query and that can often be better than the text you write.

So the general rule is that if you page is highly targeted for just a few keywords, go ahead and use meta description. But if you're page could get traffic on many "longtail" keywords, then don't use meta description.

Have important content as high as possible in the HTML document

If you take this page it's made up of a header, a sidebar and a content section. You probably know that you could write the HTML for the page one of two ways...

<div id="header">header goes here</div>
<div id="sidebar">sidebar goes here</div>
<div id="content">content goes here</div>

or you could write it this way...

<div id="header">header goes here</div>
<div id="content">content goes here</div>
<div id="sidebar">sidebar goes here</div>

The spiders tend to think that important things come first, so you want to have the content section be as high as possible, so the second example is better than the first. When you think about all the HTML code in a typical sidebar, this can make a big difference.

So there you have it... Some basic things to watch out for as you try to optimize your pages. Some of the issues are more advanced, so just watch for them and get someone to help you when you encounter them...

Categories: Duplicate Content, SEO/SEM, Spiders/Bots

Leave a Reply