AI Case Study

Disney researchers identify, track and predict movie audience enjoyment using facial recognition

Disney and affiliated university researchers introduce and test a method for detecting audience facial emotion during film watching. The model was able to identify emotions as well as predict the future emotional state of audience members after 10 minutes of observation. The potential benefit is for improving marketing research.


Consumer Goods And Services

Entertainment And Sports

Project Overview

From Gizmodo: "At IEEE’s Computer Vision and Pattern Recognition last weekend, Disney Research and Caltech explained their technique for tracking the facial expressions of people watching movies. The research team calls their new algorithm “factorized variation autoencoders” (FVAEs). They claim the technology is so effective at recognizing complex expressions that, after analyzing a single audience member’s face for about ten minutes, it can even predict that face’s future expressions throughout the remainder of a film."

Reported Results

From a technical standpoint, the FVAEs were demonstrated to "reliably predict that viewer’s facial expressions for the remainder of the movie. Furthermore, FVAEs were able to learn concepts of smiling and laughing, and that these signals correlate with humorous scenes in a movie". (IEEE paper)


"[T]he factorized variational autoencoder takes images of the faces of people watching movies and breaks them down into a series of numbers representing specific features: one number for how much a face is smiling, another for how wide open the eyes are, etc. Metadata then allow the algorithm to connect those numbers with other relevant bits of data—for example, with other images of the same face taken at different points in time, or of other faces at the same point in time. With enough information, the system can assess how an audience is reacting to a movie so accurately that it can predict an individual's responses based on just a few minutes of observation. The pattern recognition technique is not limited to faces. It can be used on any time-series data collected from a group of objects." (Caltech)

"We trained an MMOD face detector using the implementation in DLib and manually labeled 800 training images. Due to the difference in resolution between the front and back rows of the theater, we created two face detection models: one for seats in the last three rows, and one for the rest of the theater... the FVAE applies a non-linear variant of tensor factorization using deep variational autoencoders to learn a latent representation that factors linearly. Our formulation combines the compactness and interpretability of VAEs with the generalization performance of TF [tensor factorization]." (IEEE paper)

For the prediction of future emotion "we train an FVAE model on 80% of the audience members, and use the remaining 20% to test long term predictions... we see that the prediction error drops quickly and saturates after observing the first 10% of data. Moreover, the long term prediction error is consistent with the testing error from matrix completion. Intuitively, the fact that prediction error saturates after observing audience reactions for 10% of the movie agrees with established guidelines that a film has roughly ten minutes to pull the audience into the story." (IEEE paper)



Marketing Research Planning


The Walt Disney Company is "the world’s second-largest media conglomerate" and produces films among other media.



The facial training set included 800 labelled images. "In order to build a dataset of millions of facial landmarks to feed into a neural network, researchers used infrared cameras to film the audiences of 150 showings of nine movies, including recent Disney films." (Gizmodo)