Robots Exclusion Standard

Posted: December 14, 2011 in SEO

When primitive robots were first created, some of them would crash servers. A robots exclusion standard was crafted to allow you to tell any robot (or all of them) that you do not want some of your pages indexed or that you do not want your links followed. You can do this via a meta tag on the page copy
<meta name=”robots” content=”noindex,nofollow”>
or create a robots.txt file which tells the robots where NOT to go.  The official robots exclusion protocol document is located here:
http://www.robotstxt.org/wc/exclusion.html. The robot.txt file goes in the root level of your domain using robots.txt as the file name.

This allows all robots to index everything
User-agent: *
Disallow:

This  disallows all robots to your site
User-agent: *
Disallow: /

You also can disallow a folder or a single file in the robots txt file. This disallows a
folder:
User-agent: *
Disallow: /projects/

This disallows a file
User-agent: *
Disallow: /cheese/please.html

One problem many dynamic sites have is sending search engines multiple URLs with nearly identical content. If you have products in different sizes and colors or other small differences, it is likely you could generate lots of duplicate content which will prevent search engines from wanting to fully index your sites. If you place your variables at the start of your URLs then you can easily block all of the sorting options using only a few disallow lines.

For example:
User-agent: *
Disallow: /cart.php?size
Disallow: /cart.php?color
Would block search engines from indexing any URLs that started with cart.php?size or cart.php?color. Notice how there was no trailing slash at the end of the above disallow lines. That means the engines will not index anything that starts

with that in the URL. If there was a trailing slash search engines would only block a specific folder.  If the sort options were at the end of the URL, you would either need to create an exceptionally long robots.txt file or place the robots noindex meta tags inside the sort pages. You also can specify any specific user agent, such as Googlebot, instead of using the asterisks wild card. Many bad bots will ignore your robots txt files and / or harvest the blocked information, so you do not want to use robots.txt to block individuals from finding confidential information.
Googlebot also supports wildcards in the robots.txt. For example, the following:
User-agent: Googlebot
Disallow: /*sort=
would stop Googlebot from reading any URL that included the string “sort=” no matter where that string occurs in the URL.

Advertisements

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s