AdSense and robots.txt (Part 1)

My last posting was about my attempts at improving the rankings of the Guide to Electronic Fence and Pet Containment, previously known as the Invisible Fence Guide. One of the things I need to do is to get the search engines to only crawl one copy of the content, since I have the same content repeated in different locations on the site. This is done using a special file called a “robots.txt” file, but if you're careful you'll end up blocking other crawlers — like the AdSense crawler — that need to access all those pages, regardless of whether they're duplicate content or not. Here's how to do that.

Why robots.txt?

In the early days of the Web, search engines would be quite indiscriminate about what they'd index from a site. Often pages that were actually meant to be private — known just to a small group of people — would make their way into a search engine's results. For this and other reasons, a Web robots exclusion standard was developed to allow website owners to tell well-behaved robots what they couldn't index. (You can find detailed information about the standard at the robotstxt.org page.)

The basic idea is quite simple. At the root of a site — the top-level folder — the webmaster places a simple text file (not an HTML file, just a file created with a text editor like Windows Notepad) called “robots.txt” that gives a set of rules for determining which parts of the site are to be ignored by crawlers.

So if I wanted to exclude robots from certain parts of www.memwg.com, I'd create a robots.txt file (case is important, don't capitalize any part of the name!) at www.memwg.com/robots.txt. The first time a crawler came to the site, the first thing they'd do is fetch the robots.txt file using the URL http://www.memwg.com/robots.txt. If there was no such file, the crawler would assume that the whole site can be indexed, otherwise it would apply the rules it finds in the robots.txt file to the list of pages it generates for the site and only index the ones that make it through those rules. In other words, the robots.txt file is a filter that filters out URLs that shouldn't be indexed.

I should point out at this point that there is another way to exclude crawlers from indexing the contents of a web page, by using a <meta> tag at the top of the HTML file. For example:

<html>
<head>
<title>A page about nothing</title>
<meta name="robots" content="noindex,nofollow"</meta>
</head>
<body>
<p>Nothing to see, move on folks, move on.&lt/p>
</body>
</html>

See the HTML Author's Guide to the Robots META Tag for all the details. The problem with this approach, however, is it only protects HTML files, and it also requires modification of each file to be protected. The robots.txt file is a better and simpler approach to the problem.

The Classic robots.txt File

The classic robots.txt file — the one defined by the Web robots exclusion standard mentioned above — is very simple to create. The file consists of one or more exclusion rules.

An exclusion rule has two parts to it. The first part is the User-agent line that is used to identify which crawler the rule applies to. The second part is a sequence of one or more Disallow lines that identify the parts of the site that the crawler is to ignore. Here's a simple example:

User-agent: *
Disallow: /

This example tells all crawlers (”*” is a wildcard character that matches anything) to ignore all files in the “/” folder (the root folder). In other words, the rule above prevents all crawlers from crawling the entire site.

That's a bit of an extreme example. A more likely rule looks like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
Disallow: /links.html

This protects a number of directories and a single file from crawling.

User Agents

Some you might wondering what a “user agent” is. I've already written a detailed description of user agents in my article Masquerading Your Browser, which I suggest you read, but to put it simply a user agent is simply a string (a sequence of characters) that the web browser sends to the web server whenever it requests a file from the web server. The web server can use this string to identify what kind of browser is making a request and return different versions of a page. For example, in How to Detect Internet Explorer I show one way the user agent string can be used to detect that Microsoft Internet Explorer is asking for a page.

Crawlers fetch files from a web server the same way that browsers do. Well-behaved crawlers use the user agent string to identify themselves to the webserver. Google's search engine crawler, also known as the “Googlebot”, uses a string that looks like this:

Googlebot/2.1 (+http://www.googlebot.com/bot.html)

So if you want to exclude the Googlebot from a specific folder, you'd add a rule like this to your robots.txt file:

User-agent: Googlebot
Disallow: /notforgoogletosee/

This rule blocks any crawler whose user agent contains the phrase “Googlebot” from accessing the “/notforgoogletosee/” folder on the website.

Geeks collect and disseminate information about user agents. For example, you can get a list of user agents from UserAgentString.com. The AdSense crawler, for example, identifies itself as:

Mediapartners-Google/2.1

Remember that, it's important. We'll see why next time.

Sponsored Link: Learn more about the ins and outs of
AdSense by reading Uncommon AdSense, my latest book about AdSense.

Eric Giguere is the author of Uncommon AdSense and the award-nominated (that just means it lost!) blog Make Easy Money with Google and AdSense.

Socialize This Post (Please!)

Add to OnlywireAdd to Onlywire

Comments

Comments are closed.

Subscribe without commenting