Google Begin Crawling & Indexing 'GET, POST' Request

Google is now improving the GoogleBot capabilities to crawling and indexing comments in AJAX or JavaScript, such as Facebook comments or Disqus comments and others that are dynamically loaded via AJAX or JavaScript.In a blog post Google's Pawel Aleksander Fedorynski, Software Engineer, Indexing Team, and Maile Ohye, Developer Programs Tech Lead said, "With the growing […]

Google is now improving the GoogleBot capabilities to crawling and indexing comments in AJAX or JavaScript, such as Facebook comments or Disqus comments and others that are dynamically loaded via AJAX or JavaScript.

In a blog post Google's Pawel Aleksander Fedorynski, Software Engineer, Indexing Team, and Maile Ohye, Developer Programs Tech Lead said, "With the growing popularity of JavaScript and AJAX, we're finding more web pages requiring POST requests -- either for the entire content of the page or because the pages are missing information and/or look completely broken without the resources returned from POST." Adding, Google officials said, "We generally advise to use GET for fetching resources a page needs, and this is by far our preferred method of crawling. We've started experiments to rewrite POST requests to GET, and while this remains a valid strategy in some cases, often the contents returned by a web server for GET vs. POST are completely different."

The officials notes "Webmasters who want to help Google crawl and index their content and/or generate the Instant Preview -- If you'd like to prevent content from being crawled or indexed for Google Web Search, traditional robots.txt directives remain the best method. To prevent the Instant Preview for your page(s), please see our Instant Previews FAQ which describes the "Google Web Preview" User-Agent and the nosnippet meta tag."

The officails suggests:

  • Prefer GET for fetching resources, unless there's a specific reason to use POST.
  • Verify that we're allowed to crawl the resources needed to render your page. "More subtly, if the JavaScript code that issues the XMLHttpRequest is located in an external .js file disallowed by robots.txt, we won't see the connection between yummy-sundae.html and hot-fudge-info.html, so even if the latter isn't disallowed itself, that may not help us much. We've seen even more complicated chains of dependencies in the wild. To help Google better understand your site it's almost always better to allow Googlebot to crawl all resources," explains Google officials.

    You can test whether resources are blocked through Webmaster Tools "Labs -> Instant Previews."
  • "Make sure to return the same content to Googlebot as is returned to users' web browsers. Cloaking (sending different content to Googlebot than to users) is a violation of Webmaster Guidelines. We've seen numerous POST-request examples where a webmaster non-maliciously cloaked (which's still a violation), and their cloaking -- on even the smallest of changes -- then caused JavaScript errors that prevented accurate indexing and completely defeated their reason for cloaking in the first place. Summarizing, if you want your site to be search-friendly, cloaking is an all-around sticky situation that's best to avoid," notes Google execs.

    To verify that you're not accidentally cloaking, you can use Instant Previews within Webmaster Tools, or try setting the User-Agent string in your browser to something like:

  • Mozilla/5.0 (compatible; Googlebot/2.1;
    +http://www.google.com/bot.html)

    Your site shouldn't look any different after such a change. If you see a blank page, a JavaScript error, or if parts of the page are missing or different, that means that something's wrong.
  • Remember to include important content (i.e., the content you'd like indexed) as text, visible directly on the page and without requiring user-action to display. Most search engines are text-based and generally work best with text-based content.

Google execs gave the following POST example of how they are improving crawling and indexing:

Examples of Googlebot's POST requests

  • Crawling a page via a POST redirect
  • <html>
       <body onload="document.foo.submit();">
         <form name="foo" action="request.php" method="post">
           <input type="hidden" name="bar" value="234"/>
         </form>
       </body>
     </html>
  • Crawling a resource via a POST XMLHttpRequest
    In this step-by-step example, we improve both the indexing of a page and its Instant Preview by following the automatic XMLHttpRequest generated as the page renders.

    1. Google crawls the URL, yummy-sundae.html.
    2. Google begins indexing yummy-sundae.html and, as a part of this process, decides to attempt to render the page to better understand its content and/or generate the Instant Preview.
    3. During the render, yummy-sundae.html automatically sends an XMLHttpRequest for a resource, hot-fudge-info.html, using the POST method.
      <html>
         <head>
           <title>Yummy Sundae</title>
           <script src="jquery.js"></script>
         </head>
         <body>
           This page is about a yummy sundae.
           <div id="content"></div>
           <script type="text/javascript">$(document).ready(function(){$.post("hot-fudge-info.html",function(a){$("#content").html(a)})});</script>
         </body>
       </html>
    4. The URL requested through POST, hot-fudge-info.html, along with its data payload, is added to Googlebot's crawl queue.
    5. Googlebot performs a POST request to crawl hot-fudge-info.html.
    6. Google now has an accurate representation of yummy-sundae.html for Instant Previews. In certain cases, we may also incorporate the contents of hot-fudge-info.html into yummy-sundae.html.
    7. Google completes the indexing of yummy-sundae.html.
    8. User searches for [hot fudge sundae].
    9. Google's algorithms can now better determine how yummy-sundae.html is relevant for this query, and we can properly display a snapshot of the page for Instant Previews.