Xbox: Play: Microsoft Tellme Adds Voice Recognition to Kinect

"Imagine if everything you said could be interpreted by the Xbox 360 as a command," says Keith Herold, a senior program manager lead with Microsoft Tellme, the company's speech-recognition service that also powers Windows Phone 7 devices and appears in an array of other products.To solve this, the Xbox team reached out to Ivan Tashev, […]

"Imagine if everything you said could be interpreted by the Xbox 360 as a command," says Keith Herold, a senior program manager lead with Microsoft Tellme, the company's speech-recognition service that also powers Windows Phone 7 devices and appears in an array of other products.

To solve this, the Xbox team reached out to Ivan Tashev, a Microsoft Research principal software architect, who had been prototyping techs for speech enhancement, audio processing, microphone arrays and echo cancellation. For Xbox 360 system, he went to work purifying the audio signal so the Kinect could understand what it was being told. He used his expertise in echo cancellation to subdue everything coming out of the console -- soundtracks, movie dialogue, game audio -- as well as room noise the microphone would pick up. This was an immensely challenging problem based on advanced mathematics, but Tashev relished the task," revealed Microsoft.


The living room presents unique challenges for voice technology. Kinect's microphone array needs to work seamlessly up to four meters away from the couch and contend with ambient noise such as conversations, movie soundtracks and music.

Tashev explains:

In the end, the Kinect's audio enhancement chain consists of six major stages that consecutively improve the quality of the speech signal, removing clutter, noise and reverberation from the room to help the speech recognizer do its job.

With the audio pipeline in place, the next step was to integrate that signal with the Microsoft Tellme speech service. For this phase of the project, the Xbox team turned to Herold's team to bring Microsoft Tellme to the Xbox 360 platform.

"We never want a command to trigger random actions on the console," Herold says. "The idea of 'never' is not achievable of course, but we picked a suitably small number for never."

The solution to this problem was the software equivalent of a concept first developed for backpack-sized walkie-talkies in the 1940s -- the transmit, or "push-to-talk," button. This was embodied as the keyword "Xbox."

"When you say 'Xbox,' the system knows you're talking to it and what's coming next is a command. If you don't say it first, you haven't pushed the virtual 'push-to-talk' button, and the system won't listen," Herold says.

Since the Kinect supports both speech and gestures, the combined Xbox and Microsoft Tellme team spent considerable time determining how to enable both forms of interaction in a way that was complementary and intuitive. Their guiding principal was the concept of the Natural User Interface (NUI), in which people communicate with machines in the most human way possible.

For example, speech might be the best modality to search through thousands of songs, since gesturing to scroll through such a vast list could be tedious. Telling the machine, "Xbox: Bing, The Beatles" allows the user to get what they want in the most natural way possible from the vast collection of content available through Xbox LIVE.

Once the list is narrowed, using gesture to select a specific song may be the most natural interaction. Graphics, text and sounds on screen help cue users to make the interface more intuitive and easy to use.

According to Herold, this is the strength of "multimodal" interfaces, which combine speech with touch, gesture or other forms of input: Each modality is used where it is stronger, and the combination becomes much more powerful.

Advancing the Platform

For the first iteration of the device, the Xbox team prioritized the commands that would resonate most with people in their living rooms. They decided that simple navigation functions and media playback controls -- "Xbox: play. Xbox: pause." -- gave people something valuable, while also demonstrating the system's potential.

"For the launch of Kinect, we leapt over some major technology hurdles on our way to 'Xbox: play.' and 'Xbox: pause.'," Soemo says. "Nobody had ever done highly accurate speech recognition from up to four meters away, without a physical 'push-to-talk' button, in an environment filled with ambient noise, all while playing in 5.1 surround sound. Because of the collaboration among the Xbox, Microsoft Research and Microsoft Tellme teams, we were able to take science fiction and make it science fact."

Soemo says the functionality announced at E3 is just the second iteration in the journey toward the Xbox 360 system becoming the entertainment hub for the home -- redefining how people discover and use the range of media content available on Xbox LIVE and making the remote a thing of the past.

"We are laying a foundation that will transform how people interact with devices," Soemo says. "We are at that cusp. With Kinect, we've put speech into the living room. Now, Microsoft will continue to push the boundaries of NUIs to enable seamless experiences that span devices and platforms."

With that foundation in place, the Kinect's latest functionality goes well beyond simple navigation and allows people to use voice commands to traverse very large media catalogs with ease, and the team doesn't plan to stop there.

[Source: Microsoft Press]