AdSense and robots.txt (Part 3)

Previously, in AdSense and robots (part 2), I described how the standard web exclusion rules (the “disallow” syntax) made it easy to accidentally block the AdSense crawler (the “Mediabot”) from scanning pages that you'd normally want read, all because you were trying to keep the search engine crawlers from indexing certain pages. Luckily, the bright bulbs at Google came up with a solution: inclusion rules.

Google's Extensions To robots.txt

All of Google's crawlers support the Allow directive in addition to the standard Disallow directive. For example, the following syntax forbids access to all parts of a site except for the “/blog/” subdirectory:

User-agent: *
Disallow: /
Allow: /blog/

The primary use for inclusion rules, however, is to give different crawlers different levels of access. In the last posting, for example, I showed you a rule to avoid duplicate content serving for a WordPress blog:

User-agent: *
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /link.php
Disallow: /category/
Disallow: /page/
Disallow: /feed/

The problem with this rule is that it blocks all crawlers. We'd still like to see the Mediabot get access to those pages so Google can show the right ads on all pages, duplicate content or not. With the allow syntax it's quite trivial, just add a new rule targeting the Mediabot exclusively:

User-agent: *
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /link.php
Disallow: /category/
Disallow: /page/
Disallow: /feed/

User-agent: Mediapartners-Google
Allow: /2005/
Allow: /2006/
Allow: /2007/
Allow: /link.php
Allow: /category/
Allow: /page/

AdWords users should take note of this syntax, too, because landing pages are often very similar to each other and are often excluded from the search engines because of this. If you want, you can use the “allow” syntax to give the AdsBot-Google crawler (the AdWords crawler) permission to access the landing pages. Note, however, that by default the AdWords crawler ignores exclusion rules that apply to all crawlers, i.e. “User-agent: *” (here's the reference.) So in most cases your landing pages are safe. But it doesn't hurt to explicitly ensure that the AdWords crawler can see them.

Aside: Crawlers voluntarily choose to respect robots.txt. At times your system will be crawled by a crawler that ignores the rules you've defined, or (like the AdWords crawler) doesn't respect all of them. Some sites are aggressively crawled by scrapers and other crawlers with ignoble purposes and end up using other facilities for limiting access to their content by non-humans.

Google also supports extensions to the pattern matching used to determine which rules apply to which folders or files on a site. The matching used in a standard robots.txt file is pretty simple: any URL that begins with a given pattern is matched. This is why it's important to include the trailing “/” character at the end of folders. Google uses more sophisticated matching which is useful at times for more complicated scenarios.

Now, it sure would be nice if there was a way to test a robots.txt file, with and without these extensions, in order to ensure that the rules actually work as we expect them to work. Luckily, there's an easy way to the testing. We'll get to that next.

Sponsored Link: Learn more about the ins and outs of
AdSense by reading Uncommon AdSense, my latest book about AdSense.

Eric Giguere is the author of Uncommon AdSense and the award-nominated (that just means it lost!) blog Make Easy Money with Google and AdSense.

Socialize This Post (Please!)

Add to OnlywireAdd to Onlywire

Tags

Comments

Comments are closed.

Subscribe without commenting