Languages of the World (Wide Web) In 2008 And Now

The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language."Most web pages link to other pages on the same web site, and the few off-site links they've are almost always to other pages in the same language. It's […]

The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language.

"Most web pages link to other pages on the same web site, and the few off-site links they've are almost always to other pages in the same language. It's as if each language has its own web which's loosely linked to the webs of other languages. However, there're a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.

The languages of the web have become more densely connected then in 2008. There's now significant content in even more languages, and these languages are more closely linked.

To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages.

Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia. The language linkages invite explanations around geopolitics, linguistics, and historical associations. The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations," Daniel Ford and Josh Batson stated over at Google Research blog.

"What about the sizes of each language web? Both the number of sites in each language and the number of urls seen by Google's crawler follow an exponential distribution, although the ordering for each is slightly different. The exact number of pages in each language in 2008 is unknown, since multiple urls may point to the same page and some pages mayn't have been seen at all. However, the language of an un-crawled url can be guessed by the dominant language of its site. In fact, calendar pages and other infinite spaces mean that there really are an unlimited number of pages on the web, though some are more useful than others," said Ford and Batson.

"The largest language on the web, in terms of size and centrality, has always been English, but where's it on our map? Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over 45%, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan. Both the Philippines and Pakistan are former British colonies where English is one of the two official languages."

"You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have 60 to 80% of their out-language links to English pages. However, only 38% of the pages and 42% of sites in our set are English, while it attracts 79% of all out-language links from other languages. Taking this into account shows that Chinese and Japanese webs aren't unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content," Ford and Batson added.

"There's a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45% of off-site links from English pages are to other English pages, making English the most extroverted web language given it's size. Other notable outliers are the Hindi web, which's unusually introverted, and the Tagalog and Malay webs which're unusually extroverted."

"We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked to. This time, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish," explained Ford and Batson.

[Source:Google Research blog]