AI Case Study

Affectiva classifies the emotion of anger from audio data in 1.2 seconds using a convolutional neural network

Affectiva has developed a convolutional neural network to identify anger in audio data. SoundNet has been trained on over 2 million videos and annotated audiovisual emotion data including video, speech, and text transcriptions. The system is able to understand the emotion in just 1.2 seconds regardless of the speaker’s language.

Industry

Project Overview

"MIT Media Lab spinoff Affectiva’s neural network, SoundNet, which can classify anger from audio data in as little as 1.2 seconds regardless of the speaker’s language — just over the time it takes for humans to perceive anger.

SoundNet consists of a convolutional neural network — a type of neural network commonly applied to analyzing visual imagery — trained on a video dataset. To get it to recognize anger in speech, the team first sourced a large amount of general audio data — two million videos, or just over a year’s worth — with ground truth produced by another model. Then, they fine-tuned it with a smaller dataset, IEMOCAP, containing 12 hours of annotated audiovisual emotion data including video, speech, and text transcriptions.

To test the AI model’s generalizability, the team evaluated its English-trained model on Mandarin Chinese speech emotion data (the Mandarin Affective Speech Corpus, or MASC). They report that it not only generalized well to English speech data, but that it was effective on the Chinese data — albeit with a slight degradation in performance.

The researchers say that their success proves an “effective” and “low-latency” speech emotion recognition model can be significantly improved with transfer learning, a technique that leverages AI systems trained on a large dataset of previously annotated samples to bootstrap training in a new domain with sparse data — in this case, an AI system trained to classify general sounds.

“This result is promising because while emotion speech datasets are small and expensive to obtain, massive datasets for natural sound events are available, such as the dataset used to train SoundNet or Google’s AudioSet. These two datasets alone have about 15 thousand hours of labeled audio data,” the team wrote. “[Anger classification] has many useful applications, including conversational interfaces and social robots, interactive voice response (IVR) systems, market research, customer agent assessment and training, and virtual and augmented reality.”

They leave to future work tapping other large publicly available corpora, and training AI systems for related speech-based tasks, such as recognizing other types of emotions and affective states.

Reported Results

The system can "can classify anger from audio data in as little as 1.2 seconds regardless of the speaker’s language — just over the time it takes for humans to perceive anger."

Technology

Function

Background

"“[A] significant problem in harnessing the power of deep learning networks for emotion recognition is the mismatch between a large amount of data required by deep networks and the small size of emotion-labeled speech datasets,” the paper’s coauthors wrote. “[O]ur trained anger detection model improves performance and generalizes well on a variety of acted, elicited, and natural emotional speech datasets. Furthermore, our proposed system has low latency suitable for real-time applications.”"

Benefits

Data

"To get it to recognize anger in speech, the team first sourced a large amount of general audio data — two million videos, or just over a year’s worth — with ground truth produced by another model. Then, they fine-tuned it with a smaller dataset, IEMOCAP, containing 12 hours of annotated audiovisual emotion data including video, speech, and text transcriptions."