Googlebot Crawling through HTML Forms to find more pages

The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. Traditionally search engines mostly follow links in HTML. Google, now announced, that they’ve been experimenting with HTML forms to discover new web pages and URLs that otherwise couldn't be found and indexed. […]

The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. Traditionally search engines mostly follow links in HTML. Google, now announced, that they’ve been experimenting with HTML forms to discover new web pages and URLs that otherwise couldn't be found and indexed. Google is currently exploring HTML forms for some “high quality sites”. 

[…]when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

[…]our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate. Google notes that they only do this form submission for “GET" forms. A form using GET results in a parametrized URL like example.com/show?foo=bar. The guidelines for webmasters are that a GET request should never actually change data on the server, like trigger a user registraton or something; for such things, webmasters should use POST, which the Googlebot will not submit. Google also note that they “omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc.”

Plus, Google say that web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.

Google, Search Engine, Googlebot, Crawling, Indexing, HTML, Forms