Share |

Archive for April, 2006

How gzip encoding reduces bandwidth

April 24th, 2006

Yesterday, Matt Cutts posted more details about the caching that Google's crawlers are now doing to further clarify the whole AdSense push vs. AdSense pull issue. One of things he mentioned was how webmasters can turn on “gzip encoding” to save even further bandwidth. Since not everyone reading this is a webmaster, I thought I'd explain what he meant in further detail.

HTTP Headers

As you know, the HTTP protocol is what a web browser uses to communicate with a web server. The browser (a type of web client or user agent) always initiates the conversation with the web server by sending it a URL. In other words, if you type http://www.memwg.com/blog/adsense into your browser to read this blog, the browser sends a request (technically, a “GET” request) to the server located at www.memwg.com for the content located at the path /blog/adsense.

However, a bunch of other information gets sent along with the request: the type of browser being used, the user's preferred languages, the underlying operating system type, what kind of image formats are accepted, etc. (See Masquerading Your Browser for information on how to alter or hide some of this information.) This information is attached to the request as a set of headers, basically name-value pairs of data. You can use my free HTTP header viewer tool to see what headers your browser is sending right now.

Content Encoding

Normally, any data requested by the client is sent by the web server byte-for-byte down the pipe. If you request a web page that is 10,320 bytes long, the web server sends the entire 10,320 bytes to the client. In other words, the data is sent in its “raw” or “natural” form.

One of the headers that a client can send is the Accept-Encoding header, which tells the web server that the client can receive compressed data as an alternative. If the server so chooses, it selects one of the encodings that the client supports (the client sends a list of supported encodings) and compresses the data with the selected encoding algorithm. Instead of sending a 10,320 byte document in the example above, it might end up sending a 4,567 byte long document — a significant savings. (The amount of compression depends on the algorithm being used and the data being compressed. Typically, HTML pages become much smaller.)

When the server encodes data like this, it's the client's job to decode it on the other end back into its raw form. The server actually sends headers back to the client as part of the response, and one of those, the Content-Encoding header, indicates which algorithm it used for the encoding. The client can then decode the data by selecting the appropriate algorithm.

GZIP Encoding

On Unix/Linux machines, the gzip application is used to compress and decompress data. But the term “gzip” or “GZIP” is also used as shorthand for the compression/decompression algorithm used by the gzip application. So when you hear someone refer to “gzip encoding”, they're talking about data that is encoded by the same algorithm used by the gzip application.

A web browser that understands gzip encoding sends an Accept-Encoding header that looks like this:

Accept-Encoding: gzip

The web server encodes the data using the gzip algorithm and sends back the appropriate Content-Encoding header:

Content-Encoding: gzip

The browser then uses the gzip decoding algorithm to return the data to its normal, uncompressed form.

Why GZIP Encoding Helps

The idea behind gzip encoding is to reduce the amount of data being transferred over the network. In the example above, the size of the document was reduced by over half. Not only does the data transmit more quickly, you also get charged less for its transmittal — in general, the less bandwidth you're using, the less you pay.

There are downsides to gzip encoding, though. Any data compression takes time and processing cycles, so a heavily-used web server may find itself slowed down even more if gzip encoding is enabled. And not all data types compress well — images often end up being bigger when compressed — so the server shouldn't automatically compress everything, even if the client requests it. And some older clients have bugs in their decoding algorithms.

Note that gzip encoding is not limited to web browsers, it can be used by web crawlers as well. Browsers and crawlers look the same to a web server, they just have different headers. Matt indicated that Google has now enabled gzip encoding in all of its crawlers. So if you're finding that your site is being crawled excessively by crawlers and using up your precious bandwidth, make sure gzip encoding is enabled in your web server — it could make a big difference.

Eric Giguere is the contextual advertising expert who wrote Make Easy Money with Google and Uncommon AdSense. You can read this blog by mail if it's more convenient for you, just send a blank email to memwg-blog@aweber.com to subscribe.

Why can't Google have normal text referrals?

April 21st, 2006

I'm on the road for a few days without a computer, but thanks to my BlackBerry and AvantGo I can still keep up with what's new in the AdSense world. Just don't expect long postings or a lot of links, these small devices are a challenge to use for blogging.

There are a lot of smart people at Google, but sometimes they just don't get it. When I and other publishers ask for text referral links, what we want is the ability to do something like this:

http://adsense.google.com/?ref=xxxxxxxx

In other words, act like every other referral program on the planet! I don't know why they feel compelled to control the links so closely. Let me place them in plain text emails. Let me encourage people to sign up for AdSense. It's not like there's an immediate payout or anything.

That's all I'll say for now, we're off to the Ford factory tour in Dearborn…

Matt Cutts confirms AdSense publishers not crawled more frequently

April 20th, 2006

In response to a question I left him about the AdSense crawler (see yesterday's posting), Matt Cutts left me the following response:

Eric, I talked about mediabot more today and even made a couple PowerPoint slides. I may post about this more when I get back from WMW, but: pages with AdSense will not be indexed more frequently. It’s literally just a crawl cache, so if e.g. our news crawl fetched a page and then Googlebot wanted the same page, we’d retrieve the page from the crawl cache. But there’s no boost at all in rankings if you’re in AdSense or Google News. You don’t get any more pages crawled either.

In other words, it's just the “AdSense pull” model being used, which is what I thought. This lets them make better use of their bandwidth. Now there's still an open question as to whether or not all the different crawlers Google uses are all storing the same information, as some have reported different results showing up in the Google index depending on which crawler fetched the page. Time for a new question to Matt, I guess!

Eric Giguere is the contextual advertising expert who wrote Make Easy Money with Google and Uncommon AdSense. You can read this blog by mail if it's more convenient for you, just send a blank email to memwg-blog@aweber.com to subscribe.