Google Research Releases a Database of 7,560,141 Concepts and 175,100,788 Unique Text Strings; 2011 Global Diversity & Talent Inclusion Report

Google Research has made available under a Creative Commons License released a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings. The data set was designed for recall. "It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles," posted Valentin Spitkovsky and Peter Norvig, […]

Google Research has made available under a Creative Commons License released a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings. The data set was designed for recall.

"It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles," posted Valentin Spitkovsky and Peter Norvig, Research Team.

"The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article's canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept's url. Our database thus includes weights that measure degrees of association," Norvig wrote.

Google Research

"An inverted index can be used to perform reverse look-ups, identifying salient terms for each concept. Associated counts can easily be turned into percentages. The words-to-concepts dictionary direction can disambiguate senses and link entities, which are often highly ambiguous, since people, places and organizations can (nearly) all be named after each other," explains Norvig.

How do we represent concepts?

"Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia's groupings of articles into hierarchical categories," Norvig explained.

For technical details, see this paper (to be presented at LREC 2012) and the README file accompanying the database.

Diversity at Google

Google also released a report "2011 Global Diversity & Talent Inclusion Report" highlighting its diversity efforts in 2011, in which Google partnered with and donated $19 million to more than 150 organizations working on advancing diversity.

In addition to the report above, Google also created a web site for a look back its diversity efforts at 2011 diversity and inclusion highlights.

"In the U.S., fewer and fewer students are graduating with computer science degrees each year, and enrollment rates are even lower for women and underrepresented groups. It's important to grow a diverse talent pool and help develop the technologists of tomorrow who will be integral to the success of the technology industry," posted Yolanda Mangolini, Director, Global Diversity & Inclusion/Talent & Outreach Programs.