Google's research team in its mysterious "X Lab" facility, has been working on some new approaches to large-scale machine learning.
"For example, say we're trying to build a system that can distinguish between pictures of cars and motorcycles. In the standard machine learning approach, we first have to collect tens of thousands of pictures that have already been labeled as "car" or "motorcycle"--what we call labeled data--to train the system. But labeling takes a lot of work, and there's comparatively little labeled data out there. So we developed a distributed computing infrastructure for training large-scale neural networks. Then, we took an artificial neural network and spread the computation across 16,000 of our CPU cores (in our data centers), and trained models with more than 1 billion connections," explains Posted by Jeff Dean, Google Fellow and Andrew Ng, Visiting Faculty.
Google X laboratory's "brain simulator" is equipped with 1,000 computers and 16,000 computational units (ie processors) with 1 billion connections overall: the system is used to run advanced machine learning algorithms developed by Google researchers to search and analyze huge amounts of data.
Fed with 10 million images corresponding to as many YouTube videos thumbnails, the algorithms were able to recognize a cat's face - clearly one of the most pervasive units of digital information clogging the Internet (before, between and likely after YouTube existence) during the last two decades.
"We ran experiments that asked, informally: If we think of our neural network as simulating a very small-scale "newborn brain," and show it YouTube video for a week, what will it learn? Our hypothesis was that it would learn to recognize common objects in those videos. Indeed, to our amusement, one of our artificial neurons learned to respond strongly to pictures of... cats," Dean wrote.
"Remember that this network had never been told what a cat was, nor was it given even a single image labeled as a cat. Instead, it "discovered" what a cat looked like by itself from only unlabeled YouTube stills. That's what we mean by self-taught learning."
"Using this large-scale neural network, we also significantly improved the state of the art on a standard image classification test--in fact, we saw a 70 percent relative improvement in accuracy," Dena said.
"We achieved that by taking advantage of the vast amounts of unlabeled data available on the web, and using it to augment a much more limited set of labeled data. This is something we're really focused on--how to develop machine learning systems that scale well, so that we can take advantage of vast sets of unlabeled training data."
"We're actively working on scaling our systems to train even larger models. As a very rough comparison an adult human brain has around 100 trillion connections….So we still have lots of room to grow," Deam concludes.
You can read the full paper here.