Problems with Google Sitemaps

It seemed like the whole blogosphere was abuzz with Google’s unveiling of its new Google Sitemaps service, a free inclusion service where you publish an XML file of your site pages to Google so its spider can get a better sense of what to crawl of your site.

This is good news, especially for dynamic sites that aren’t getting fully indexed. I appreciate Google once again showing its thought leadership. Not only is Google giving Webmasters a new way to relay information about their site structure information to its spiders, but it’s sharing this new technology with the other search engines by releasing the protocol and code as open source.

This all sounds wonderful, but there are two quite major problems with Google’s approach.

First, it doesn’t solve the duplicate pages problem that a great many dynamic sites have. Even the Google Store suffers from this. The Google Sitemaps protocol does not provide a way for Webmasters to convey which pages are duplicates of other pages. A site that gets crawled incorrectly by Googlebot, because of superfluous or non-essential parameters/flags being included in the URLs of links on the pages, will continue to get crawled incorrectly. An “Official Google Sitemaps Team Member” states that the sitemap XML file will merely augment their crawl, it won’t replace existing pages in the index:

This program is a complement to, not a replacement of, the regular crawl. The benefit of Sitemaps is twofold:

For links we already know about through our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site.
For the links we don’t know about, we plan to use the additional links you supply, to increase our crawl coverage.

The high-level Google engineer who goes by GoogleGuy in the online forums explains Google Sitemaps this way:

Imagine if you have pages A, B and C on your site. We find pages A and B through our normal web crawl of your links. Then you build a sitemap and list the pages B and C. Now there’s a chance (but not a promise) that we’ll crawl page C. We won’t drop page A just because you didn’t list it in your sitemap. And just because you listed a page that we didn’t know about doesn’t guarantee that we’ll crawl it. But if for some reason we didn’t see any links to C, or maybe we knew about page C but the URL was rejected for having too many parameters or some other reason, now there’s a chance that we’ll crawl that page C.

So, the way I read GoogleGuy’s explanation, if pages A and C are essentially duplicates of each other, with A containing an additional superfluous parameter in its URL (like sortby=default or lang=english), then BOTH could end up in Google’s index. Thus, Google Sitemaps won’t reduce the amount of duplication in Google’s index; in fact, I believe it will increase it.

Duplicate pages, on its own, may not sound like a problem for Webmasters as much as it is for Google itself, which has to dedicate additional resources to maintain all this redundant content in its index. However, it does have serious implications for Webmasters, because it results in PageRank dilution — where multiple versions of a page split up the “votes” (links) and PageRank score that a single version of the page would aggregate.

This brings me to the second, related problem with Google Sitemaps: It doesn’t do anything to alleviate the phenomenon of PageRank dilution. PageRank dilution results in lower PageRank, which in turn results in lower rankings. For example, consider that the above-mentioned Google Store’s product page (the “Black is Back T-Shirt”) is in Google’s index five times instead of just once. So each of those five variations earns only a fraction of the total potential PageRank score that it could have earned if all the links pointed to a single “Black is Back T-Shirt” page.

Google Sitemaps needs to provide a way to convey, or to sync up with, the site’s hierarchical internal linking structure so that it’s clear which pages should get how much of a share of the PageRank flowing into the site’s home page. Since the primary holder of PageRank score is the home page (that is, after all, the page that most everyone links to), it’s up to the site’s internal hierarchical linking structure to pass the PageRank of the home page to the rest of the site. As such, a page that is two clicks away from the home page will get a much larger share of PageRank score passed on to it from the home page versus a page that is five clicks away from the home page.

Here’s how I suggest both of the above issues be rectified: by extending robots.txt with some additional directives that specify:

Which parameter in a dynamic URL is the “key field.”
Which parameter is the product ID and which is the category ID (specifically for online catalogs).
Which parameters are superfluous or that don’t significantly vary the content displayed.

Armed with this information, Googlebot will be able to not only eliminate duplicate pages but also intelligently choose the most appropriate version to save in its index and then associate with that page the PageRank of ALL versions of the page. The days of session IDs killing a site’s Google visibility would be over! Google admits in its Sitemaps FAQ that session IDs are still a problem even with the advent of Google Sitemaps:

Question: URLs on my site have session IDs in them. Do I need to remove them?

Yes. Including session IDs in URLs may result in incomplete and redundant crawling of your site.

Remember, getting indexed only gets you to the party, it doesn’t mean you’re going to be popular at the party. Google Sitemaps may help you get more pages indexed, but if those pages all have a PageRank score of “0,” then what was the point? It’ll be like sitting along the wall the whole time with no one asking you to dance!

GravityStream, our SEO proxy technology, deals with PageRank dilution by distilling URLs in links into their lowest common denominator and replacing them on the proxy. We’ve found that even as Googlebot gets more aggressive at spidering dynamic sites with complex URLs and starts indexing one of our clients’ sites more fully, our proxy still has a major leg-up on the native site that it’s proxying. For example, our GravityStream proxy of PETsMART.com is No. 1 in Google for “best pet toys,” and yet the corresponding page on the PETsMART.com native site is nowhere in the first 10 pages of results even though it is indexed.

Until Google extends Google Sitemaps to deal with PageRank dilution, I’d expect that a GravityStream proxy will still trump a native site, even if it’s using Google Sitemaps. So I guess I’m asking Google to extend its protocol to make GravityStream unnecessary. We’ll lose our GravityStream revenues, but the Web will be a better place, so I’m OK with that. So here’s hoping that Google takes my feedback on board and reworks their protocol!