Bing Explains 'Reducing Junk' in the Search Results Pages

The first week of March, Harry Shum, cvp for Bing R&D started a new series called "Search Quality Insights," designed to help folks understand a bit better why search ticks the way it often does.In the second article in this series from Dr. Richard Qian, Partner Development Manager, Core Search, Bing R&D, discuss reducing junk […]

The first week of March, Harry Shum, cvp for Bing R&D started a new series called "Search Quality Insights," designed to help folks understand a bit better why search ticks the way it often does.

In the second article in this series from Dr. Richard Qian, Partner Development Manager, Core Search, Bing R&D, discuss reducing junk in the bing search results.

"In search there's perhaps nothing worse than clicking a search result only to get an error message in return. Or being taken to a page that tells you the page you were trying to open can no longer be found. Equally annoying is when you end up on a page for a domain that was just registered and is plastered with ads without any useful content. These are different types of junk links that we refer to as dead links, soft 404s, and parked domains," Qian said.

Here is how Bing detects and removes these sites from the search results:

A Dead Links is when a 4xx or 5xx error code is returned from an HTTP request for a page. "Until we crawl these pages again and discover they're missing, we may still serve them in our search results. In general we want to remove all dead links from our search results as quickly as possible. However we observed many such issues were only transient and some pages came back alive after a short while. The classifiers that solve this problem help us decide whether a page is just temporarily experiencing an issue or if it has truly been deleted. If we think there is some suspicion about the page in question we may boost its re-crawl priority and frequency to help us make a determination as quickly as possible. There is an important tradeoff we make here between aggressively removing dead links and the relevance of our search results. We aim to minimize the number of dead links in our search results without removing content that may experience temporary issues but may be useful to our users," Qian explained.

A Soft 404 is like a hard 404 where the page was deleted from the site it is hosted on. But in this case the server still returns a normal HTTP 200 code with a webpage reporting that the original page you were trying to reach no longer exists. "Our high precision classifiers in this area use page content such as key phrases in the page's title, body and URL to determine if the page is a soft 404 and whether to remove it from the search results. E.g., for the query {Five Guys Burgers and Fries history}," explainsQian.

Here'e Bing's top results before and after Bing applied soft 404 classifiers:

Bing soft 404 classifiers

Parked Domains refer to web sites that have placeholder content after a new domain registration. Usually these pages show ads to try to monetize traffic to the domain before it has been properly setup by the new site owner. "Like the techniques used to identify soft 404s, we look at the patterns in page content to determine if a page is a parked domain. By collaboratively mining many different types of pages against our large index of web data we are able to create signatures that allow us to identify parked domains when we see them and to remove them from our search results," Qian explains.

Junky Snippets

Bing uses the standard UTF-8 encoding in our internal document processing stack and recommends that site owners use UTF-8 to minimize any potential conversion issues that result in junky snippets. We do support other encoding formats and do so by employing a learnt classifier to detect the encoding of an input page and then convert it to UTF-8. We also trained a classifier to catch unreadable or garbage text. In addition Bing identifies HTML, XML, JavaScript and other markups using a comprehensive parser and we continue to improve its robustness in handling various types of pages and corner cases.

"The improved coverage and precision of our encoding classifier, document convertor, garbage detector, and HTML parser have reduced the occurrence of junky snippets in Bing's search results," explained Qian. Here is an example showing the improvement on a junky snippet for the query {Yimin Xiao}:

Bing Junky Sinppets

Empty Snippets

Many websites today make heavy use of client-side technologies like AJAX and Flash to provide rich and dynamic user experiences. Bing developed an in-house document convertor to translate these documents into HTML which can then be used for ranking and snippet generation. Qian notes, "Bing embraces the richness of the web with our dynamic crawlers and document processors that render and index dynamically generated pages. We also utilize a number of classifiers to determine whether a page is a plain static page or needs to be dynamically rendered."

Bing Empty Sinppets