Monday, August 1, 2011

Most likely reasons for site de-indexed in Google

If your site has been suddenly de-indexed from Google, then this chart will guide you through the most likely reasons

Site de-indexed - Quick Text View>
  • Has there been any removal requests via webmaster tools?
  • Has your site been hacked or compromised recently or currently?
  • Do you have a message from Webmaster Tools mentioning hacked spammy content ?
  • Do you show different content to googlebot and users with the intent that users should not see the text?
  • Have you very recently purchased the domain and it is preowned?
  • Do you have a DMCA notice and/or is there one in the search results when looking for your site?
  • Do you have any other sites that are in any way similar?
  • Do you duplicate, copy, or rewrite content that can be found elsewhere?
  • Is your site themed around an affiliate scheme?
  • Does your site contain a very large amount of non-exclusive content?
  • Have you recently amended or removed problem content? 
Update April 2012 : This information is still current. If you have been given another reason not on this list for your site to have been de-indexed then treat with caution and look for authoritative supporting evidence. 

Saturday, July 30, 2011

De indexed by Google - unlikely reasons for the ultimate penalty.

If your site has been removed from Google's index; then, alongside another post on this site, Likely reasons for de-indexed; consider too some unlikely reasons for your site delisting.

  • Selling links or engaging in link schemes.
  • Linking to bad neighbourhoods.
  • Hosting multiple sites on the same server.
  • Duplicate content caused by standard site structure.
  • Copyright issues without a DMCA notice in search results.
  • Lack on nofollow tag on popular affiliate links
  • Canonical issues
  • Issues concerning H, H2 and H3 tags
  • HTML Markup or non 'validation'
  • Lack of alt text
  • Lack of a favicon
  • Capitalization issues in meta tags
  • Text colour issues whilst text is generally viewable
  • Lack of a privacy policy, terms and conditions and contact information.
  • Benign duplicated data in /head area.
  • Any issues regarding sitemaps
  • Accessibility and navigation issues
These unlikely reasons have been drawn from submissions found in webmaster forums everywhere..

    Thursday, June 2, 2011

    Posts regarding the mechanics of indexing.

    Googlebot noticed your site, uses an SSL certificate which may be considered invalid by web browsers
    Without knowing your site, it's hard to judge the exact situation. In general, we send this message when we think that you might have content hosted on https and have found the SSL certificate to be invalid. In practice, this doesn't affect your site's crawling, indexing, or ranking. It may, however, confuse users when they click on one of these results and see a certificate warning in their browser, which is why we flag it for webmasters. 

    If you really don't use https, then I'd double-check to make sure that none of your content is being indexed like that, and then you're welcome to ignore this message. If you do find content indexed as https, I'd recommend using the usual canonicalization methods to resolve that:

    Crawling www and non www >
     For example, many sites let both the www and non-www versions of their site get indexed, without using something to control canonicalization (such as a redirect or the rel=canonical). Our algorithms will try to concentrate on one of those versions for search, meaning that we'd tend to crawl & index that version much more frequently than the other one. In a case like that, it could happen that the less-favored version was last seen with an older version of the CMS, just because we haven't been crawling it as frequently.

    .. having a site indexed with www and non-www URLs at the same time is not a problem and generally wouldn't result in fluctuations in ranking in web-search. It helps us a little bit when we can focus on a single host name (www or non-www) since we don't have to worry about crawling both versions, but for the largest part, our algorithms also get along fine when there's a mixture of both.

    Canonical tags : following, pagination, print and product pages.
    When we see a canonical link element like that and follow it (which we mostly do), we'll treat it similarly to a redirect. So if you play around with the rel=canonical, you have to be very careful because you won't see the "redirect" that Googlebot will use for indexing.

    - Pagination: this is complicated, I personally would be careful when using with rel=canonical with paginated lists. The important part is that we should be able to find all products listed, so at the very least those lists should provide a default sort order where we can access (and index) all pages. Since this is somewhat difficult unless you really, really know what you are doing, I would personally avoid adding rel=canonical for these pages. One possible solution could be to use JavaScript for paginated lists with different sort orders, for example, that way you would have a single URL which lists all products.

    - Printer-friendly pages: Personally, I'd suggest just using a normal printer style sheet, which would let you keep the same URLs. Short of that, using a rel=canonical is also fine.

    - Product pages: If you have separate URLs for the same products (eg books>non-fiction>guide-to-Italy // books>travel>guide-to-Italy) then picking one and pointing that canonical from the other pages is fine. Setting a category page as canonical seems like a bad idea since we won't be able to index the product pages.

    Crawling canonical links
    The data shown there is based on our crawling activity, which is why you'd see those URLs there if you're using rel=canonical. We have to crawl and index these URLs first, before the rel=canonical is extracted, so it may even happen that they are temporarily visible in the search results. That's fine - and not something you'd need to prevent. As we process the content there, we'll focus on your preferred canonical for further indexing.

    The sky is not falling : www and non-www
    In general, just having a site accessible via www and non-www is not so much of a problem.

    We're generally pretty good at figuring that out
    While cleaning up issues like canonicalization with 301 redirects are good, they aren't the most important thing on a website. If It gets way too complicated to fix that with your current setup, I'd just leave them as is, perhaps using Webmaster Tools to select your preference if you can. We're generally pretty good at figuring that out, no need to worry too much about it :-).

    Google auto-canonicalise ?
    Yes, we can and do sort this kind of issue out algorithmically all the time :-). Most sites don't specify a canonical in Webmaster Tools, yet we index them just the same. That said, if we notice that and show the same page, we'll just pick one of those and show it to our users in the search results. By doing that, there's a chance that we might not pick the one which YOU prefer -- so with this setting and with the 301 redirect you have a way of telling us your preference.
    There are a few other advantages of specifying one or the other. For example, in order for us to notice that the content on both URLs is the same, we have to actually crawl both versions. Depending on your website and on your server, this might not be a problem -- or it might be a big problem (if accessing those URLs uses a lot of your resources). By using a redirect or specifying a canonical version you can help reduce that overhead.
    At any rate, no you certainly don't have to do this; it's just something that you could do if you wanted to :).
    Regarding the original question, if we have chosen to index your site as "" then you won't find it by searching for "" (because we don't have the "www" part in the URLs). However, if you turn it around and tell us to index "" we'll have both versions available. Regardless of that, when a user searches for your URL they generally already know how to reach you, so this is usually not something worth getting grey hair over.

    Cleaning up the index ?
    From the search you mentioned, I searched for some of the product titles there. For the ones that I checked, your HTTPS pages did not show up in the search results, so I wouldn't really worry about it. Give it time and as we recrawl these pages, we'll update them in the index accordingly. At any rate, since the pages redirect to the preferred ones, you wouldn't have to specify the "noindex" x-robots-tag anyway and in addition, any users who happen to come through the HTTPS pages will make it to your site regardless. There's generally no need to clean up the indexed URLs this granularly :-).

    Session urls in sitemap files
    If you are not submitting clean URLs in your Sitemap file, you'd be better off not using a Sitemap file. With session-IDs in there, it'll cause more problems (with us crawling and indexing those URLs) than if you just let us crawl your website normally (especially if you really have a clean URL structure). So my advice would be to either delete the Sitemap file, or make sure that the submitted URLs are really exactly the same, clean ones that we find while crawling.

    304 Not Modified
    As many servers are incorrectly configured, we do not always crawl using conditional requests, so what you are seeing -- as far as I understand it -- would be normal. Additionally, as Cristina mentioned, the "Fetch as Googlebot" feature will always use unconditional requests, so you should also see the "200 OK" there as well. Additionally, the type of request made will generally not have an influence on your site's ranking (assuming your server is returning the proper content for those requests).

    302 redirect away from root
    For what it's worth, a 302 redirect is the correct redirect from a root URL to a detail page (such as from "/" to "/sites/bursa/"). This is one of the few situations where a 302 redirect is preferred over a 301 redirect. However, as Colin mentioned, if you were hosting this yourself, you might want to look into saving an additional jump by just serving the content directly (it's not necessary, but if you can do it, it's always nice to save the user a redirect).

    Generally speaking, with a 302 redirect we'd try to take the content of the redirect target (in your case PAGE-B) and index it under the redirecting URL (in your case PAGE-A). If the target has a noindex meta tag, then it's likely that we'd apply that to the redirecting URL as well.

    Change of Hosting
    Secondly, it seems you changed hosting infrastructure around May 11th. When our algorithms see a hosting change, they try to lower Googlebot's crawling rate as a safety mechanism to not overload the servers. In time, as we crawl more and learn more about the crawling load the hosting seems capable of handling, the algorithms will automatically try to increase the crawl rate. You're seeing this process when you report 30% growth in crawl rate recently, and there is a good chance that will continue to grow.

    Making a great site
    Looking through here, I think Cristina mentioned a really good point -- having great content, especially on your homepage, can do wonders for your site's visibility in search results. Not only will it provide something for our crawlers to pick up & to help us better understand your website, but it will also be something that can and will attract links from other websites.
    In my opinion, next to having a technically "ok" website, the content itself is one of the biggest "SEO-elements" that you can work on for your site. That's not something you need a SEO company for, that's something which you -- as the expert in that business -- need to work on yourself. Make something that you would recommend to others in the same business!