Official Google Webmaster Central Blog: July 2011

Once upon a time there was an online store, fairyclothes.example.com. The store’s website used parameters in its URLs, and the same content could be reached through multiple URLs. One day the store owner noticed, that too many redundant URLs could be preventing Googlebot from crawling the site thoroughly. So he sent his assistant CuriousQuestionAsker to The GreatWebWizard to get advice on using the URL parameters feature to reduce the duplicate content crawled by Googlebot. The Great WebWizard was famous for his wisdom. He looked at the URL parameters and proposed the following configuration:

Parameter name	Effect on content?	What should Googlebot crawl?
trackingId	None	One representative URL
sortOrder	Sorts	Only URLs with value = ‘lowToHigh’
sortBy	Sorts	Only URLs with value = ‘price’
filterByColor	Narrows	No URLs
itemId	Specifies	Every URL
page	Paginates	Every URL

The CuriousQuestionAsker couldn’t avoid his nature and started asking questions:

CuriousQuestionAsker: You’ve instructed Googlebot to choose a representative URL for trackingId (value to be chosen by Googlebot). Why not select the Only URLs with value=x option and choose the value myself?
Great WebWizard: While crawling the web Googlebot encountered the following URLs that link to your site:

fairyclothes.example.com/skirts/?trackingId=aaa123
fairyclothes.example.com/skirts/?trackingId=aaa124
fairyclothes.example.com/trousers/?trackingId=aaa125

Imagine that you were to tell Googebot to only crawl URLs where “trackingId=aaa125”. In that case Googlebot would not crawl URLs 1 and 2 as neither of them has the value aaa125 for trackingId. Their content would neither be crawled nor indexed and none of your inventory of fine skirts would show up in Google’s search results. No, for this case choosing a representative URL is the way to go. Why? Because that tells Googlebot that when it encounters two URLs on the web that differ only in this parameter (as URLs 1 and 2 above do) then it only needs to crawl one of them (either will do) and it will still get all the content. In the example above two URLs will be crawled; either 1 & 3, or 2 & 3. Not a single skirt or trouser will be lost.

CuriousQuestionAsker: What about the sortOrder parameter? I don’t care if the items are listed in ascending or descending order. Why not let Google select a representative value?
Great WebWizard: As Googlebot continues to crawl it may find the following URLs:

fairyclothes.example.com/skirts/?page=1&sortBy=price&sortOrder=’lowToHigh’
fairyclothes.example.com/skirts/?page=1&sortBy=price&sortOrder=’highToLow’
fairyclothes.example.com/skirts/?page=2&sortBy=price&sortOrder=’lowToHigh’
fairyclothes.example.com/skirts/?page=2&sortBy=price&sortOrder=’ highToLow’

Notice how the first pair of URLs (1 & 2) differs only in the value of the sortOrder parameter as do URLs in the second pair (3 & 4). However, URLs 1 and 2 will produce different content: the first showing the least expensive of your skirts while the second showing the priciest. That should be your first hint that using a single representative value is not a good choice for this situation. Moreover, if you let Googlebot choose a single representative from among a set of URLs that differ only in their sortOrder parameter it might choose a different value each time. In the example above, from the first pair of URLs, URL 1 might be chosen (sortOrder=’lowToHigh’). Whereas from the second pair URL 4 might be picked (sortOrder=’ highToLow’). If that were to happen Googlebot would crawl only the least expensive skirts (twice):

fairyclothes.example.com/skirts/?page=1&sortBy=price&sortOrder=’lowToHigh’
fairyclothes.example.com/skirts/?page=2&sortBy=price&sortOrder=’ highToLow’

Your most expensive skirts would not be crawled at all! When dealing with sorting parameters consistency is key. Always sort the same way.

CuriousQuestionAsker: How about the sortBy value?
Great WebWizard: This is very similar to the sortOrder attribute. You want the crawled URLs of your listing to be sorted consistently throughout all the pages, otherwise some of the items may not be visible to Googlebot. However, you should be careful which value you choose. If you sell books as well as shoes in your store, it would be better not to select the value ‘title’ since URLs pointing to shoes never contain ‘sortBy=title’, so they will not be crawled. Likewise setting ‘sortBy=size’ works well for crawling shoes, but not for crawling books. Keep in mind that parameters configuration has influence throughout the whole site.

CuriousQuestionAsker: Why not crawl URLs with parameter filterByColor?
Great WebWizard: Imagine that you have a three-page list of skirts. Some of the skirts are blue, some of them are red and others are green.

fairyclothes.example.com/skirts/?page=1
fairyclothes.example.com/skirts/?page=2
fairyclothes.example.com/skirts/?page=3

This list is filterable. When a user selects a color, she gets two pages of blue skirts:

fairyclothes.example.com/skirts/?page=1&flterByColor=blue
fairyclothes.example.com/skirts/?page=2&flterByColor=blue

They seem like new pages (the set of items are different from all other pages), but there is actually no new content on them, since all the blue skirts were already included in the original three pages. There’s no need to crawl URLs that narrow the content by color, since the content served on those URLs was already crawled. There is one important thing to notice here: before you disallow some URLs from being crawled by selecting the “No URLs” option, make sure that Googlebot can access the content in another way. Considering our example, Googlebot needs to be able to find the first three links on your site, and there should be no settings that prevent crawling them.

- - -

If your site has URL parameters that are potentially creating duplicate content issues then you should check out the new URL Parameters feature in Webmaster Tools. Let us know what you think or if you have any questions post them to the Webmaster Help Forum.

Written by Kamila Primke, Software Engineer, Webmaster Tools Team

On average there are about three validation issues per page produced by the Webmaster Team (as we combine HTML and CSS validation in the scoring process, information about the origin gets lost), down from about four issues per page two years ago.

This information is valuable for us as it tells us how close we are to our goal of always shipping perfectly valid code, and it also tells us whether we’re on track or not. As you can see, with the exception of the 2nd quarter of 2009 and the 1st quarter of 2010, we are generally observing a positive trend.

What has to be kept in mind are issues with the integrity of the data, i.e. the sample size as well as “false positives” in the validators. We’re working with the W3C in several ways, including reporting and helping to fix issues in the validators; however, as software can never be perfect, sometimes pages get dinged for non-issues: see for example the border-radius issue that has recently been fixed. We know that this is negatively affecting the validation scores we’re determining, but we have no data yet to indicate how much.

Although we track more than just validation for quality control purposes, validation plays an important role in measuring the health of Google’s informational websites.

How do you use validation in your development process?

Posted by Jens O. Meiert, Google Webmaster Team

Webmaster Central Blog

Preview the latest +1 button changes

Page Speed Service - Web Performance, Delivered.

The +1 Button: Now Faster

Improved handling of URLs with parameters

Validation: measuring and tracking code quality

Labels

Archive

Feed

Subscribe via email