Avoid .0, .EXE, .TAR, .TGZ and other binary file extenion in URLs for better Google Indexing

SEOMoz has a post, reporting that ending your URLs with a .0 will prevent your pages from being crawled in the Google index. They had a url that ended with “/web2.0” . It looks like previously they had a url looked like “/web2.0/” (note the trailing slash), which we were happy to crawl/index/rank. But when their linkage […]

SEOMoz has a post, reporting that ending your URLs with a .0 will prevent your pages from being crawled in the Google index. They had a url that ended with “/web2.0” . It looks like previously they had a url looked like “/web2.0/” (note the trailing slash), which we were happy to crawl/index/rank. But when their linkage shifted enough that “/web2.0&Prime became their preferred url, Google wouldn’t crawl urls ending in “.0&Prime, so the page became uncrawled.

Matt Cutts addressed this issue over his blog:

[…]people will ask me “Does it matter what extension I use for my pages? Does Google prefer .php over .asp, or .html over .htm?” And my answer is “We’re happy to crawl all of these file extensions. It doesn’t matter what you choose between any of those.”

“But there are some file extensions that are mostly binary data, such as .exe, where the vast majority of the time the data would be meaningless blobs, so there are a few extensions to avoid. If your files are named example.dll or example.bin and you don’t see Google crawling pages with that file extension, I’d recommend changing your file extension to something else.”

There’s a simple way to check whether Google will crawl things with a certain filetype extension. If you do a query such as [filetype:exe] and you don’t see any urls that end directly in “.exe” then that means either:

1) there are no such files on the web, which we know isn’t true for .exe, or

2) Google chooses not to crawl such pages at this time — usually because pages with that file extension have been unusually useless in the past. So for example, if you query for [filetype:tgz] or [filetype:tar], you’ll see urls such as “papers.ssrn.com/pape.tar?abstract_id” that contain “.tar” but no files that end directly in .tar. That means that you probably shouldn’t make your html pages end in .tar.

Even though urls ending in “.0&Prime are often binary and therefore end up getting dropped later in our indexing pipeline, it’s always good to revisit old decisions and respond to feedback by running new tests. So just in the last day or so, we switched it so that Google is willing to crawl pages that end in in “.0&Prime. This will help the small number of pages out on the web that want to serve up HTML pages with a “.0&Prime extension.

He ends with following takeaways:

  • Why Google doesn’t crawl some filetype extensions (when we’ve seen good evidence that the extensions are mostly binary or otherwise not-very-indexable files).
  • An easy was to use the filetype: operator, so that you can decide whether to avoid a particular filename extension yourself.
  • Google is willing to revisit old decisions and test them again, which is what we’re doing with the “.0&Prime filetype extension.

The conclusion:

  1. Avoid URLs ending with binary files extensions
  2. .0, .EXE. .TAR. .TGZ

Source:→ SEL