AI Case Study
Netflix optimises image selection for video merchandising using computer vision to segment video into frames with key elements
Netflix is using an AI suite to determine which images to select from a video to use in their digital merchandising. These images are displayed to customers on the Netflix website. The AI uses deep networks and computer vision to detect key objects and actors and general aesthetics that have been identified as high impact.
Consumer Goods And Services
Entertainment And Sports
Netflix optimises its image selection in the following way: "we first came up with objective signals that we can measure for each and every frame of the video using Frame Annotations. As result, we can collect an effective representation of each frame of the video. Subsequently, we created ranking algorithms that allows us to rank a subset of frames that meets aesthetic, creative and diversity objectives to represent content accurately for various canvases of our product."
Creative And Brand
Previous research indicated to Netflix that customers look at the image associated with a video on its website first and based on their reaction, determine whether they seek out additional information about the video. These merchandising stills are frames taken directly from a video/series and thus optimising the frame shown to customers can increase likelihood they select a title.
"A single season of average TV show (about 10 episodes) contains nearly 9 million total frames. Asking creative editors to efficiently sift through that many frames of video to identify one frame that will capture an audience’s attention is tedious and ineffective. We set out to build a tool that quickly and effectively identifies which frames are the best moments to represent a title on the Netflix service."
In-house: "In order to scale horizontally and have predictable SLA for a growing catalog of content, we utilized the Archer framework to process our videos more efficiently. Archer allowed us to split the videos into smaller sized chunks that could each be processed in parallel. This has enabled us to scale by lending efficiency to our video processing pipelines, and allowing us to integrate more and more content intelligence algorithms into our tool sets. ...Every frame of video in a piece of content is processed through a series of computer vision algorithms to gather objective frame metadata, latent representation of frame, as well as some of the contextual metadata that those frame(s) contain. The annotation properties that we process and apply to our video frames can be roughly grouped into 3 main categories: Visual Metadata, Contextual Metadata (Face detection, Motion estimation, Camera shot identification , Object detection). After we’ve processed and annotated every frame in a given video, the next step is to surface “the best” image candidates from those frames through an automated artwork pipeline. That way, when our creative teams are ready to begin work for a piece of content, they are automatically provided with a high quality image set to choose from. Actors play a very important role in artwork. One way we identify the key character for a given episode is by utilizing a combination of face clustering and actor recognition to prioritize main characters and de-prioritize secondary characters or extras. To accomplish this, we trained a deep-learning model to trace facial similarities from all qualifying candidate frames tagged with frame annotation to surface and rank the main actors of a given title without knowing anything about the cast members."