Sunday, November 28, 2010

Google Web Preview - A Bad Bot

Google's new search feature which allows users to preview your site before visiting may mean that you have seen many instances of this Google user agent in your log files : USERAGENT: Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13 Google preview says that it works by generating previews during normal crawls. Also, the web preview user agent fetches your page content live to show to search result browsers if no cached preview is saved. Google Web Preview and robots.txt Google Web Preview is a Google user agent which does not respect or even read a robots.txt file. It does as it wishes without reference to your robots.txt file because Google Web Preview is something which people browsing the search results utilise; a user initiated function. Google previews are also generated by normal google crawls. So, when a searcher looks for your preview they will either be served a cached one from a past crawl or a fresh one generated by Google Web Preview which would then also be cached for future results. So, this means that there are two methods of generating previews for your site which will add to confusion when you try and diagnose any problems as you will not know which method has generated the preview you are seeing. Google Web Preview and Cloaking
Cloaking refers to the practice of presenting different content or URLs to users and search engines.
Google Web Preview is not a search engine. It is a browser based utility and so you can modify content as you wish. Blocking Google Web Preview By htaccess I have tried to block Web preview by htaccess and can confirm that the following works # ban spam bots RewriteEngine on RewriteCond %{HTTP_USER_AGENT} ^(.*)Preview(.*)$ RewriteRule ^(.*)$ http://www.google.co.uk [R=301,L] ## You may know of more graceful methods. That is my suggested method. The rewrite rule to google's home page will end up showing google's home page as your preview snapshot and so you might want to change that to a more inviting page for people to see in the search results. Update - Further testing of the htacess block for Google Web Preview and I have seen that Google does not always follow the redirect. Either it works fine and the preview page shown is Google's home page (as directed above) or the result is the 'Preview not available message' (Should it be that Web Preview does not follow all redirects then there will be other ways to present different content depending on the user agent which does not redirect.) Note: This will only work for the previews generated by the Google Web Preview bad bot and will have no effect on the ones generated by a normal crawl, though in current testing I see that if you target pages newly indexd and not often crawled by Google to see the previews then you may have more chance of being able to generate a live preview request which will then be cached. Whether this turns out true and accurate on a wider scale.. time will tell. Goodbye to Googlebot-Image blocking image indexing Update Feb 2011. This page has been updated. Originally Google suggested that a directive in place for googlebot-image would affect previews. This was proven not to be the case. Do not worry if you have an instruction for googlebot-image in your robots.txt file. I have removed previous commentary here to avoid confusing issues.