In AdSense and robots.txt (Part 1) I described the basic syntax for the robots.txt file. Today we look at how the robots.txt file can affect your AdSense income if you're not careful with how you declare the exclusion rules.
Mediabot: The AdSense Crawler
AdSense is an automated program. The very first time an AdSense ad is displayed on a page, Google quickly sends out a web crawler to read and analyze the content of the page so that it can properly tailor the ads it displays to the content of the page. The basic algorithm for how it chooses the ads is described in extensive detail in the AdSense patent application. (You can get my analysis of the patent for free by purchasing Uncommon AdSense, by the way.) The crawler is commonly known as the “Mediabot” because of the user agent string it sends, “Mediapartners-Google/2.1″.
Until the Mediabot has a chance to examine the page in question, AdSense selects ads based on other factors, such as:
- the URL of the page itself (which may contain ad-triggering keywords)
- the content of other pages (previously analyzed themselves) that link to the page in question
- the search queries that lead to the page
Again, all this information is detailed in the AdSense patent application. As long as the new page's content is in line with these other factors, the ads you'll see displayed will probably be on-topic. But the real determination of which ads best fit the page won't be made until a few seconds or a few minutes after the first ad is displayed.
And after the crawler's seen the page once, it will come back to occasionally revisit the page as long as AdSense ads are being displayed on the page. It usually comes fairly frequently, but there's no way to control the scheduling.
But note this: the Mediabot respects the rules in the robots.txt file. This has important implications.
Don't Ban the Mediabot!
A common use of exclusion rules is to prevent search engine crawlers from seeing duplicate content. It's very easy to serve up duplicate content in blogs, for example, because most blogs have extremely flexible navigation paths — besides viewing individual posts, you can view them grouped by date, by category, etc. Normally you only want to present one view of the postings to Google, Yahoo, MSN, etc. For example, WordPress SEO – using robots.txt to avoid content duplication presents an exclusion rule that WordPress blog owners can use to present a single view of the blog content to the search engines. Here's a small fragment of the exclusion rule:
User-agent: * Disallow: /2005/ Disallow: /2006/ Disallow: /2007/ Disallow: /link.php Disallow: /category/ Disallow: /page/ Disallow: /feed/
Notice the use of the wildcard character in the user agent part of the rule. You're not just banning the search engine crawlers from those pages, you're also banning the Mediabot! So if you display AdSense ads on the blog, they won't be properly targeted on the pages you've blocked via the exclusion rule. If you're like a lot of WordPress users and you use permalinks that start with the date of the posting, you've just blocked crawling of those individual postings!
So you have to be careful in how you define your exclusion rules. The wildcard character should be used with caution. If you can, create specific rules to block specific crawlers instead. The problem with this approach, of course, is that there are simply too many crawlers to list, so you'll need to limit yourself to the big ones like Google (“Googlebot”), Yahoo! (“Slurp”) and Microsoft (“msnbot”) and forget about the rest.
At this point you're probably thinking that it's too bad the robots.txt file can't let you specify inclusion rules in addition to exclusion rules. Google thought so, too, and so they came up with an extension to the robots.txt syntax that lets you do precisely that.
But more on that and how it affects the AdSense crawler in part 3.
Sponsored Link: Learn more about the ins and outs of
AdSense by reading Uncommon AdSense, my latest book about AdSense.
Eric Giguere is the author of Uncommon AdSense and the award-nominated (that just means it lost!) blog Make Easy Money with Google and AdSense.