What’s wrong with Google Sitemaps

June 6th, 2005

by Stephan Spencer

Last Friday it seemed like the whole blogosphere was abuzz with the news that Google unveiled its new Google Sitemaps service, a free inclusion service where you publish an XML file of your site pages to Google so its spider can get a better sense of what to crawl of your site. This is good news, especially for dynamic sites that aren’t getting fully indexed. I appreciate Google once again showing its thought leadership. Not only is Google giving webmasters a new way to relay information about their site structure information to its spiders, but it’s sharing this new technology with the other search engines by releasing the protocol and code as open source.

This all sounds wonderful, but there are 2 quite major problems with Google’s approach.

  • First, it doesn’t solve the duplicate pages problem that a great many dynamic sites have. Even the Google Store suffers from this (which I blogged about previously but here’s a more recent example of a Google Store product page being duplicated times in Google’s index). The Google Sitemaps protocol does not provide a way for webmasters to convey which pages are duplicates of other pages. A site that gets crawled incorrectly by Googlebot, due to superfluous or non-essential parameters/flags being included in the URLs of links on the pages, will continue to get crawled incorrectly. An “Official Google Sitemaps Team Member” states that the sitemap XML file will merely augment their crawl, it won’t replace existing pages in the index:

    This program is a complement to, not a replacement of, the regular crawl. The benefit of Sitemaps is two fold:
    – For links we already know about thro our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site.
    – For the links we dont know about, we plan to use the additional links you supply, to increase our crawl coverage.

    The high-level Google engineer who goes by GoogleGuy in the online forums explains Google Sitemaps in this way:

    Imagine if you have pages A, B, and C on your site. We find pages A and B through our normal web crawl of your links. Then you build a sitemap and list the pages B and C. Now there’s a chance (but not a promise) that we’ll crawl page C. We won’t drop page A just because you didn’t list it in your sitemap. And just because you listed a page that we didn’t know about doesn’t guarantee that we’ll crawl it. But if for some reason we didn’t see any links to C, or maybe we knew about page C but the url was rejected for having too many parameters or some other reason, now there’s a chance that we’ll crawl that page C.

    So, the way I read GoogleGuy’s explanation, if pages A and C are essentially duplicates of each other, with A containing an additional superfluous parameter in its URL (like sortby=default or lang=english), then BOTH could end up in Google’s index. Thus, Google Sitemaps won’t reduce the amount of duplication in Google’s index; in fact, I believe it will increase it.

    Duplicate pages, on its own, may not sound like a problem for webmasters as much as it is for Google itself, which has to dedicate additional resources to maintain all this redundant content in its index. However, it does have serious implications for webmasters, because it results in PageRank dilution ?Ä® where multiple versions of a page split up the “votes” (links) and PageRank score that a single version of the page would aggregate.

  • This brings me to the second, related problem with Google Sitemaps: it doesn’t do anything to alleviate the phenomenon of PageRank dilution. PageRank dilution results in lower PageRank, which in turn results in lower rankings. For example, consider that the above-mentioned Google Store’s product page (the “Black is Back T-Shirt”) is in Google’s index 5 times instead of just once. So each of those 5 variations earns only a fraction of the total potential PageRank score that it could have earned if all the links pointed to a single “Black is Back T-Shirt” page.Google Sitemaps needs to provide a way to convey, or to sync up with, the site’s hierarchical internal linking structure, so that it’s clear which pages should get how much of a share of the PageRank flowing into the site’s home page. Since the primary holder of PageRank score is the home page (that is, after all, the page that most everyone links to), it’s up to the site’s internal hierarchical linking structure to pass the PageRank of the home page to the rest of the site. As such, a page that is 2 clicks away from the home page will get a much larger share of PageRank score passed on to it from the home page, versus a page that is 5 clicks away from the home page.

Here’s how I suggest both of the above issues be rectified: by extending robots.txt with some additional directives that specify:

  • which parameter in a dynamic URL is the “key field”
  • which parameter is the product ID and which is the category ID (specifically for online catalogs)
  • which parameters are superfluous or that don’t signficantly vary the content displayed

Armed with this information, Googlebot will be able to not only eliminate duplicate pages but also intelligently choose the most appropriate version to save in its index and then associate with that page the PageRank of ALL versions of the page. The days of session IDs killing a site’s Google visibility would be over! Google admits in its Sitemaps FAQ that session IDs are still a problem even with the advent of Google Sitemaps:

Q: URLs on my site have session IDs in them. Do I need to remove them?

Yes. Including session IDs in URLs may result in incomplete and redundant crawling of your site.

Remember, getting indexed only gets you to the party, it doesn’t mean you’re going to be popular at the party. Google Sitemaps may help you get more pages indexed, but if those pages all have a PageRank score of 0, then what was the point? It’ll be like sitting along the wall the whole time with no one asking you to dance!

GravityStream, our SEO proxy technology (the concept of SEO proxies is explained in my article in Catalog Age last October) deals with PageRank dilution by distilling URLs in links into their lowest common denominator and replacing them on the proxy. We’ve found that, even as Googlebot gets more aggressive at spidering dynamic sites with complex URLs and starts indexing one of our clients’ sites more fully, our proxy still has a major leg-up on the native site that it’s proxying. For example, our GravityStream proxy of PETsMART.com is #1 in Google for “best pet toys”, and yet the corresponding page on the PETsMART.com native site is nowhere in the first 10 pages of results even though it is indexed. Until Google extends Google Sitemaps to deal with PageRank dilution, I’d expect that a GravityStream proxy will still trump a native site, even if it’s using Google Sitemaps. That means that currently, despite Google Sitemaps, GravityStream still plays an important role for online retailers. Nonetheless, it’s my sincere hope that Google takes my feedback on board and reworks their protocol!

Unethical SEO vendors – can you spot em?

May 30th, 2005

by Stephan Spencer

You can’t just ask a Search Engine Optimization vendor if they are ethical. Of course they will say “yes.” So if you are shopping for some SEO help, how do you screen out the baddies?

A while back I blogged about how to be objective with your SEO vendor selection, but I didn’t specifically cover how to screen out the unethical ones. I will do that now.

First off, interview the vendor extensively. Get them to explain the techniques they will be using. A “yes” from them to any of the following questions is a warning sign:

  • Do your techniques involve any kind of deception?
  • Do you use proprietary techniques?
  • Do you use doorway pages or anything similar?
  • Do you do deceptive redirects?
  • Have you ever had sites banned?
  • Do you offer rank guarantees? (You can?Ä´t guarantee something you have no control over. The only way you can get a guaranteed rank is through pay-per-click.)
  • Do you send email to prospects with whom they do not have a prior existing business relationship or permission from those prospects in advance? (If so, that’s spam! Never do business with a spammer.)

During your discussions with the vendor, if they describe their SEO tactics as short-term, you might want to reconsider. SEO, when done right (i.e. when following “best practices”), has long-term sustainable impact ?Ä® for years, in fact. For proof, just read this.

After you’re done quizzing the vendor, talk to their clients. Ask those clients:

  • Does your SEO vendor teach you how to fish, or do they always do the fishing for you?
  • Have your traffic and sales gone up a lot because of the vendor? If so, do you believe the increase to be sustainable?
  • How long have you worked with the vendor? How long do you plan to continue working with them? Any idea what the vendor’s client churn rate is?

Then you’ll need to do some of your own investigating. Check the HTML code on their clients’ sites for hidden text, hidden links, and so forth. Also examine what their clients’ websites are serving to the search engines. There are a couple different ways to view a website through the eyes of a search engine spider: one is through a Firefox browser extension called User Agent Switcher; the other is through the cached version of the page that was indexed by the engine, available from the Cached link in the search results. Compare and contrast the page meant for the search engines to that corresponding page off the native website as seen by a normal visitor. If the content served up to the search engines is something completely different than what is served up to visitors, then they are spamming. Things to look for when making your comparison: if the title tag is significantly different, and if keywords have been stuffed into the body copy, the meta tags, and into parts of the website to help the version that was shown to search engines rank better. Finally, search the online forums and SEO directories like SEOPros.com and SEOConsultants.com with Google for complaints about the vendor.

Got any horror stories or lessons learned to share from dealing with a less-than-stellar SEO vendor? Post a comment.

Podcasting and SEO: How to SEO your podcasts

April 17th, 2005

by Stephan Spencer

There has been plenty of discussion in the blogosphere about blogs and search engine optimization (SEO). Google in particular seems to love blogs. Blogs are rich in content, heavily linked, with links that tend to be contextual, and without much in the way of code bloat or gratuitous flash animation. In short, blogs are search engine friendly out-of-the-box.

But what about SEO’ing a podcast, the blog’s newest cousin?

Podcasting (where anyone can become an Internet radio talk show host or DJ) presents unique opportunities to the marketer/content producer that blogging does not. I expound on this a bit more in my recent MarketingProfs article but the benefits of podcasting from an SEO standpoint wouldn’t seem as obvious. Podcasts are usually audio content, so you don’t get all this rich textual content that the search engine spiders can snarf up. You also don’t get the rich inter-linking that happens with blogs because you can’t embed clickable URLs throughout your MP3 files.

Nonetheless, I believe you can SEO your podcasts. Here’s how:

  1. Come up with a name for your podcast show that is rich with relevant heavily searched-on keywords.
  2. Make sure your MP3 files have really good ID3 tags ?Ä® rich with relevant keywords. ID3V2 even supports comment and URL fields. The major search engines may not pick up the ID3 tags now, but they will! And besides, there are specialty engines and software tools that already do.
  3. Synopsize each podcast show in text and blog that. Put your most important keywords as high up in the blog post as possible but still keep it readable and interesting.
  4. Encourage those who link directly to your MP3 file to also link to your blog post about the podcast.
  5. Consider using a transcription service to transcribe your podcast or at least excerpts of it for use as search engine fodder. Break the transcript up into sections. Make sure each section is on a separate web page and each separate web page has a great keyword-rich title relating to that segment of the podcast. And, of course, link to the podcast MP3 from those web pages. There are many transcription services out there, where you can just email them the MP3 file or give them an URL and they send you back a Word document. Here’s a partial list of transcription services .
  6. Submit your podcast site to podcast directories and search engines such as audio.weblogs.com.
  7. Let people in your industry, such as bloggers and the media, know that you have a podcast because podcasting is quite new and novel. It will be more newsworthy and linkworthy than just another blog in your industry.
  8. Don’t just get up on your soapbox. Have conversations with others, in the form of recorded phone interviews, and podcast those as well. Pick people who have great reputations on the web and great PageRank scores, and ask that they link to your site and to your podcast summary page.

This isn’t meant to be a comprehensive list of tactics. It is simply meant as a catalyst for creative thinking. SEO, in particular the link building aspect, isn’t about just following a set list of formulae. It is about creatively thinking outside the box and differentiating yourself in ways that make your site eminently more linkworthy than your competitors.

RSS and SEO: Implications for Search Marketers

March 2nd, 2005

by Stephan Spencer

Hello from Search Engine Strategies in NYC. Yesterday I spoke at the Webfeeds, Blogs, and Search session. My talk was focused on on implementing RSS feeds as part of your search engine marketing strategy. I’ve made my Powerpoint deck available online at www.netconcepts.com/learn/rss.ppt.

A lot of people mistakenly lump blogs and RSS together, but RSS has infinitely more applications beyond just blogs! For example: news alerts, latest specials, clearance items, upcoming events, new stock arrivals, new articles, new tools & resources, search results, a book’s revision history, top 10 best sellers (like Amazon.com does in many of its product categories), project management activities, forum/listserve posts, recently added downloads, etc.

There are some important tracking and measurement issues to consider when implementing RSS:

  • You should be tracking reads by embedding a uniquely-named 1-pixel gif within the <content:encoded> container. This is known as a “web bug.” Email marketers have been using web bugs to track open rates for ages.
  • You should be tracking clickthroughs by replacing all URLs in the <link> containers with clicktracked URLs. You code this in-house or you could use a hosted ASP service like SimpleFeed to do this for you. (Incidentally, Feedburner offers imprecise counts based on user’s IP not on clicktracked URLs)
  • You should be tracking circulation (# of subscribers). Again, you could use a service like Simplefeed… Feedburner, which categorizes visiting user-agents into bots, browsers, aggregators, and clients. Bots and browsers don’t generally “count” as subscribers, while a single hit from an aggregator may represent a number of readers. This number is usually revealed within the User-Agent in the server logs… for example Bloglines/2.0 (…; xx subscribers). Today, tracking readership from clients is an inexact science. Hopefully in the future, RSS newreader software will generate a hashcode from the subscriber’s email address and this hashcode would then get passed in the User-Agent on every HTTP request for the RSS feed.

I consider personalized RSS feeds to be “best practice.” As of yet I’m not seeing much yet in the way of personalization within RSS feeds, but that will come I’m sure. It has to. Having only one generic RSS feed per site is a one-size-fits-all approach that can’t scale. On the other hand, having too many feeds to choose from on a site can overwhelm the user. So how about instead you offer a single RSS feed, but it’s one where the content is personalized to the interests of the individual subscriber. Yet if the feed is being syndicated onto public websites, you’ll want to discover that (by checking the referrers in your server logs) and then make sure the RSS feed content is quite consistent from syndicated site to syndicated site so that these sites all reinforce the search engine juice of the same pages with similar link text. Or simply ask the subscriber his/her intentions (personal reading or syndication on a public website) as part of the personalization/subscription signup process.

IMPORTANT: An oft overlooked area of RSS click tracking is how to pass on the search engine juice from the syndicating sites to your destination site. Use clicktracked URLs with query string parameters kept to a minimum, then 301 redirect not 302. This is important! 302 redirects, also known as temporary redirects, can hang up the search engine juice. Search engines recommend you use 301 redirects, also known as permanent redirects. Surprisingly, Feedburner and Simplefeed both use 302 redirects. Tsk tsk!

Sites using your feeds for themed content to add to their site for SEO purposes could strip out your links or cut off the flow of the search engine juice using the nofollow rel attribute or by removing the hrefs altogether. Scan for that and then cut off any offenders’ feed access.

Some more “gotchas” if you don’t set things up right:

  • You should own your feed URL (unless you want to be forever tied to Feedburner or whatever RSS hosting service you are using). Remember the days long ago when people put their earthlink.net email addresses on their business cards? Don’t repeat that mistake with RSS feeds.
  • You need to proactively ensure your listings in the Yahoo SERPs display the “Add to My Yahoo!” link; don’t just assume it will happen. To do this, subscribe to your feed from your own My Yahoo! page (so you know you have at least one My Yahoo! subscriber), then set up your blog to automatically “ping” Yahoo! every time you post a new blog entry (I recommend using Pingomatic.com to do this because then it will also ping Technorati etc. for you too, all in one fell swoop, every time your make an update to your blog.)
  • Configure your website to allow subscribers to subscribe easily using your home page address if they don’t know your RSS feed address. That means putting <link> tags in your HTML. For example:
    <link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.stephanspencer.com/index.rdf" />
    Also add buttons to your web pages for 1-click adding to the most popular RSS newsreaders / aggregators, such as: “Subscribe in NewsGator,” “Subscribe on Bloglines,” and “Add to My Yahoo!”

RSS is great for link building. Any SEO worth his/her salt should be making use of RSS as part of a link building strategy, or at least making plans to use it soon. In addition to RSS, there are some other effective blog-related link building strategies, like:

  • Getting onto bloggers’ “blogrolls” (the list of their favorite blogs that they post on their site for all to see)
  • Getting links through “trackbacks” (excerpts of your blog posts that appear on other bloggers’ blog entries in a way that you initiate rather than them)

An objective approach to choosing an SEO vendor

January 10th, 2005

by Stephan Spencer

In the midst of choosing an SEO vendor to advise or implement search engine optimization for you? Don’t base your decision on just a ‘gut feel’. Effectively separating the wheat from the chaff requires that objective rather than subjective criteria be used. These include:

  1. PageRank scores
    Review PageRank scores of your candidate SEO firms’ home pages and their clients’ home pages. PageRank is Google’s scoring system for importance; it’s logarithmic like a Richter scale. Check PageRanks with the Google Toolbar. If you don’t have the Google toolbar installed on your browser, it’s probably easier just to use the free service at http://www.seochat.com/seo-tools/pagerank-lookup/. Probably more enlightening however is to use the Google Directory to check PageRanks, because then you can see where they sit in comparison to a bunch of competitors in that same category, since the sites on each category page are listed in order of PageRank score. To do so, go to http://directory.google.com and type in the name of the business into the search box (e.g. “Netconcepts”), then when you find its listing in the search results, click on the category name (e.g. “Computers > Internet > … > Designers > Full Service > N”). Look for that company’s listing on that category page. Hopefully it’s near the top, and hopefully the little green bar in the left column is more green than gray.
  2. Rankings
    Get a list of keywords from the SEO firm that they consider important to their business. Get a list of keywords from them that are important to their clients too. Check where they rank in Google for those keywords. If you have time, check rankings in Yahoo too (Yahoo has 32% market share, Google has 45%). Then, and here’s the important bit: check how popular those keywords are with searchers, using the Overture Search Term Suggestion Tool at http://inventory.overture.com (or better yet, on WordTracker.com if you have a paid subscription to it). If the keyword is searched on infrequently, then a high ranking for that keyword is not so impressive.
  3. Evidence of thought leadership
    Everyone claims to be a thought leader. A true thought leader, however, demonstrates this through such things as:

    • known reputation in that topic area by other thought leaders you know and trust
    • number of published articles written in that topic area
    • the caliber of those articles
    • number of conference presentations given in that topic area
    • the caliber of those presentations
    • number of books written that adequately cover that topic area
    • the caliber of those books
    • the extent to which they are quoted in the media in that topic area
    • a well-read, well-linked, and oft-quoted blog (web log)

Google bug reveals favored web sites

January 9th, 2005

by Stephan Spencer

A couple months ago I shared one of my Google secrets, since that secret no longer worked. ;-) Specifically, it was how to obtain a list of the most important web sites according to Google.

Now, surprisingly, this little trick appears to work again (it stopped working in 2003), thanks to a bug introduced into Google’s algorithm. Two months ago, a search for http would have revealed results like HTTP – Hypertext Transfer Protocol Overview and Welcome! – The Apache HTTP Server Project. Today, these sites appear nowhere near the top of the results. Instead, the top results are occupied by a “who’s who” list of highly important web sites — sites that don’t include the word http anywhere in the text of the page.

As already noted by blogger Nathan Weinberg, this same phenonemon occurs when you search for www.

One thing I found curious is that http and www Google queries return different results. Now these results are NOT in order of PageRank score, at least not the PageRank scores as revealed by the Google Toolbar. You can verify this to be the case yourself simply by using SEO Chat’s PageRank Search tool. Indeed, it’s a well-known fact within the SEO community that the PageRank scores served up by the Google Toolbar servers are not the actual PageRanks used by Google in the ranking algorithm. PageRank debate aside, perhaps this list offers us a (now) rare glimpse at some of Google’s Chosen Ones — the most important sites on the Internet according to Google.

What makes me say this is due to a bug in Google? For one thing, these results are NOT relevant to the search query. Secondly, I’ve uncovered another bug newly introduced into Google’s algorithm, namely that the inurl: query operator does not work properly, and I think these two bugs might be related. For an example of this second bug in action, search Google for site:blogs.msdn.com scoble inurl:msnsearch and the top search result is currently blogs.msdn.com/mikehall/archive/2004/11/10/255417.aspx. Note there’s no msnsearch in that URL!

I’ve compiled a list the top 1000 results for each of the two queries for your convenience. You’ll see, they do vary quite dramatically:

(more…)

Yahoo! & Google’s overlaping results fewer than you’d think

August 29th, 2004

by Stephan Spencer

There’s a brand new meta search engine on the block called Jux2. Its premise is to find the overlap between the top 10 results across two major search engines. So far I’m really impressed with it. It even has a toolbar for Mozilla FireFox.

Jux2 conducted some tests to determine just how much overlap there is in the top search results on Google versus Yahoo! The results of their tests are very interesting. Such as:

  • Analysis of Google and Yahoo! search results on the 500 most popular search terms found that, on average, Google and Yahoo! shared only 3.8 of their top 10 results. Furthermore, 30% of the search terms had 2 or fewer overlapping terms, and only 17% had 6 or more overlapping results among the top 10.
  • The overlapping set of top 10 results between Google and Ask Jeeves was even smaller: 3.4 out of 10. And between Yahoo! and Ask Jeeves, smaller yet: 3.1 out of 10.
  • Analysis of 91 random searches on Google and Yahoo! found that the two engines share only 23% of their top 100 results. Furthermore, only 4.8 of Google’s top 10 results even made Yahoo’s top 100. And only 5.4 of Yahoo’s top 10 made Google’s top 100.

For me, Jux2′s findings were a good reminder that the algorithms of the major search engines are markedly different, more so than one might imagine. So a metasearch engine that compares and contrasts two partially overlapping sets of search results makes a lot of sense. I think I’ll try Jux2 for a while and report back on my experiences.

Spiders like Googlebot choke on Session IDs

June 25th, 2004

by Stephan Spencer

Many ecommerce sites have session IDs or user IDs in the URL of their pages. This tends to cause either the pages to not get indexed by search engines like Google, or to cause the pages to get included many times over and over, clogging up the index with duplicates (this phenonemon is called a “spider trap”). Furthermore, having all these duplicates in the index causes the site’s importance score, known as PageRank, to be spread out across all these duplicates (this phenonemon is called “PageRank dilution”).

Ironically, Googlebot regularly gets caught in a spider trap while spidering one of its own sites – the Google Store (where they sell branded caps, shirts, umbrellas, etc.). The URLs of the store are not very search engine friendly: they and are overly complex, and include session IDs. This has resulted in 3,440 duplicate copies of the Accessories page and 3,420 copies of the Office page, for example.

If you have a dynamic, database-driven website and you want to avoid your own site becoming a spider trap, you’ll need to keep your URLs simple. Try to avoid having any ?, &, or = characters in the URLs. And try to keep the number of “parameters” to a minimum. With URLs and search engine friendliness, less is more.


Pages (4):1234