AI Case Study

Apple iOS 10 creates thematic movies from user photos using facial recognition while remaining low latency

Apple has introduced a new user media management application for devices running its iOS 10 system. The app, called "Memories", uses facial recognition and other AI to automatically create thematic mini movies with soundtracks, all processed on the user's device. This is achieved while keeping computational resource requirements low so as to avoid user inconvenience.



Software And It Services

Project Overview

The app operates on Apple iPhone and iPad products. From TechCrunch: "Using local, on-device facial recognition and AI detection of what’s in your images, Memories can combine photos and videos into themed mini-movies complete with transitions and a soundtrack." Users can "adjust settings to select different themes like chill, gentle, or uplifting, and a short, medium, or long length", and upload and share them on various platforms. Memories are automatically created and stored on the device itself, rather than being a third-party cloud-based app, emphasising user data privacy.

Reported Results

Results undisclosed; however, the automatic generation of Memories has resulted in some users complaining that as the app cannot differentiate between pleasant and unpleasant memories, (or simply mundane photos), content which they would rather not see has popped up. User satisfaction in general may be difficult to control.

From a technical success standpoint with regards to the application being implemented on user devices, Apple claims their AI work has ensured "that our users can enjoy local, low-latency, private deep learning inference without being aware that their phone is running neural networks at several hundreds of gigaflops per second".


According to Apple: "We built our initial architecture based on some of the insights from the OverFeat paper, resulting in a fully convolutional network with a multitask objective comprising of:

*a binary classification to predict the presence or absence of a face in the input, and
*a regression to predict the bounding box parameters that best localized the face in the input.

...The challenge then was how to train a simple and compact network that could mimic the behavior of the accurate but highly complex networks. We decided to leverage an approach, informally called “teacher-student” training[4]. This approach provided us a mechanism to train a second thin-and-deep network (the “student”), in such a way that it matched very closely the outputs of the big complex network (the “teacher”) that we had trained as described previously. The student network was composed of a simple repeating structure of 3x3 convolutions and pooling layers and its architecture was heavily tailored to best leverage our neural network inference engine. Now, finally, we had an algorithm for a deep neural network for face detection that was feasible for on-device execution. We iterated through several rounds of training to obtain a network model that was accurate enough to enable the desired applications.

The joy of ease-of-use would quickly dissipate if our face detection API were not able to be used both in real time apps and in background system processes... To reduce memory footprint, we allocate the intermediate layers of our neural networks by analyzing the compute graph. For Vision, the detector runs 5 networks (one for each image pyramid scale as shown in Figure 2). These 5 networks share the same weights and parameters, but have different shapes for their input, output, and intermediate layers. To reduce footprint even further, we run the liveness-based memory optimization algorithm on the joint graph composed by those 5 networks, significantly reducing the footprint... To achieve better performance, we exploit the fully convolutional nature of the network: All the scales are dynamically resized to match the resolution of the input image."


R And D

Product Development


Apple has introduced its operating system-native app "Memories" to compete with other similar products, such as the third-party designed "Timehop" and Facebook's "On this day". Increased media features such as this may help retain customers for smart phone and tablet products against competitors such as Android. The difficulty with introducing such a program running discretely in the background on a user's device is that it can be memory — and subsequently battery — intensive.



According to Apple: "We experimented with several ways of training such a network. For example, a simple procedure for training is to create a large dataset of image tiles of a fixed size corresponding to the smallest valid input to the network such that each tile produces a single output from the network. The training dataset is ideally balanced, so that half of the tiles contain a face (positive class) and the other half do not contain a face (negative class). For each positive tile, we provide the true location (x, y, w, h) of the face. We train the network to optimize the multitask objective described previously. Once trained, the network is able to predict whether a tile contains a face, and if so, it also provides the coordinates and scale of the face in the tile."