“Crawl budget” is a term that we heard a number of definitions for without a clear definition of what it means. Even, Google itself doesn’t have a single term to define it, however, in a post crawling and indexing teams at Google explains, “what crawl budget is, how crawl rate limits work, what is crawl demand, what factors impact a site’s crawl budget, and what it means for Googlebot.”
The post emphasizes that for most publishers, crawl budget is not something they have to worry about. “If new pages tend to be crawled the same day they’re published, crawl budget isn’t something webmasters need to focus on,” says Gary. “if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.”
Likewise, for bigger sites, webmaster may want to look at for example, “Prioritizing what to crawl, when, and how much resource a site’s hosting server can allocate to crawling, or those that auto-generate pages based on URL parameters,” he said.
The following is a summary of the key points:
“crawl rate limit,” is which limits the maximum fetching rate for a given site while crawling, so to make sure it doesn’t degrade users experience visitng the site. Simply put this, when a Googlebot crawls a site, it sets a unique number of simultaneous parallel connections for crawling, as well as time it must wait between fetches for that site. Crawl rate limit is based on the following two factors:
- first a crawl health, “when a site responds quickly, Googlebot can use more connections, while the site that slows down, Googlebot crawls less.”
- the second factor is a crawl limit set in Search Console — allows website owners to manually set crawl rate limit for their sites within the Site Settings section.
“Crawl demand”, is where there’ll be low Googlebot activity if there’s no demand from indexing, thus the crawl rate limit is negligle. It’s influenced by two significant factors ” popularity and staleness.”
URLs more popular on internet are more oftenly crawled to keep them fresh in search index, while also preventing URLs from becoming stale in the index. Additionally, “site-wide events like site moves, triggers an increase in crawl demand to reindex the new URLs.”
Thus, taking the two “crawl rate and crawl demand” into consideration, Google define the crawl budget as, “the number of URLs Googlebot can and wants to crawl.”
Lastly, in order for a site to maintain an optimal crawl budget, he recommends, “not to waste resources on low-value-add URLs,” as they affect crawling and indexing activity for high-quality content, thus causing a significant delay in discovering great content on a site.
The low-value-add URLs as defined by Gary are those:
- Faceted navigation and session identifiers
- On-site duplicate content
- Soft error pages
- Hacked pages
- Infinite spaces and proxies
- Low quality and spam content
Other notes about crawl budget:
- for Googlebot a speedy site can get more content over the same number of connections.
- a significant number of 5xx errors or connection timeouts, slows down crawling. So, you should monitor the Crawl Errors report in Search Console and keep number of server errors lto ow.
- while crawling is necessary for being in search results, it’s not a ranking signal. Google uses hundreds of signals to rank the results.
- non-standard “crawl-delay” robots.txt directive is not processed by Googlebot.
- pages marked as nofollow can still be crawled, and may affect your crawl budget, if linking pages to that URL don’t label the link as nofollow, explined Gary.
Google says, it no longer using the Link: Operator, that was once used to find pages linking to a specific domain in the Google search. In a recent tweet, John Mueller at Google, recommended “not to use the link operator in search.” However, Google still to reflect chnages in its Search Console help document regarding links, which still references link operator:
“To find a sampling of links to any site, you can perform a Google search using the link: operator. For instance, [link:www.google.com] will list a selection of the web pages that have links pointing to the Google home page. Note there can be no space between the “link:” and the web page URL,” reads the document.
Back in Feburary, 2009, Matt Cutts, answering questions about the link:operator, admitted “the link operator was only designed to return a small sampling of backlinks to prevent SEOs from reverse engineering another site’s rankings.”
“How accurate is Google’s backlink-check (link:…)? Are all nofollow backlinks filtered out or why does Yahoo/MSN show quite more backlink results?”
“If you have inbound links from reputable sites but those sites do not show up in a link:webname.com search, does this mean you are not getting any ‘credit’ in Google’s eyes for having inbound links?”
“The short answer is that historically, we only had room for a very small percentage of backlinks because web search was the main part and we didn’t have a ton of servers for link colon queries and so, we have doubled or increased the amount of backlinks that we show over time for link colon, but it is still a sub-sample. It’s a relatively small percentage. And I think that that’s a pretty good balance, because if you just automatically show a ton of backlinks for any website then spammers or competitors can use that to try to reverse engineer someone’s rankings.”
Google has begun a test that involve YouTube videos are now being embedded in Google Image Search for mobile for certain retail based queries. The queries that triggered these searches include “men jackets”, “lookbook”, “winter outfit”.
These playable YouTube videos are labeled as “new look on YouTube.”
Alex said the videos contain “no sound and you can’t stop or hide the video, which continues to play on repeat.”