AI Case Study

Google's cocktail party effect uniquely identifies and focuses on an individual's voice while watching a video of people talking in a crowded room

Google Research created models that could separate speech in a crowded room. They created a deep neural network system based on ensembles, convolutional neural networks and bidirectional LSTMs models. They trained it on 2000 hours of video clips of a single person visible to the camera and talking with no background interference. They were able to accurately separate speech and this could lead to applications in speech enhancement and recognition in videos along with improved hearing aids.

Industry

Technology

Internet Services Consumer

Project Overview

Google developed "a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to 'focus' the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate.

"Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context. We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.

Reported Results

Google claims "clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest)."

Google believes "this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking."

Technology

An Audio-Visual Speech Separation Model
"To generate training examples, we started by gathering a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From these videos, we extracted segments with a clean speech (e.g. no mixed music, audience sounds or other speakers) and with a single speaker visible in the video frames. This resulted in roughly 2000 hours of video clips, each of a single person visible to the camera and talking with no background interference. We then used this clean data to generate 'synthetic cocktail parties' -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet."

Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video’s soundtrack. During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker. The output masks are multiplied by the noisy input spectrogram and converted back to a time-domain waveform to obtain an isolated, clean speech signal for each speaker. For full details, see our paper.

Function

R And D

Core Research And Development

Background

"People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally 'muting' all other voices and sounds. Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers."

Benefits

Data

"To generate training examples, we started by gathering a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From these videos, we extracted segments with a clean speech (e.g. no mixed music, audience sounds or other speakers) and with a single speaker visible in the video frames. This resulted in roughly 2000 hours of video clips, each of a single person visible to the camera and talking with no background interference. We then used this clean data to generate 'synthetic cocktail parties' -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet."