top of page

AI Case Study

Yelp optimises visual interaction with users on its web page using deep learning to auto identify cover pictures as well as classify photos from the millions of pictures uploaded by users

Yelp has developed an supervised learning CNN algorithm to auto-classify the millions of photos of businesses uploaded by users. They analysed user behaviour to discover that users prefer one food photo and one non-food photo as cover. Further the new images are auto-classified into different groups for display.



Internet Services Consumer

Project Overview

Yelp wanted to make photo browsing easier for its users and figure out how to best use the photos to capture user attention. They developed a CNN based classifier to auto update semantic data about user uploaded photographs enabling a tabbed view. They also identified that users interact most when the cover photo has two pictures - one with food and one showing the ambience.

"To help simplify the problem, they focused initially on only sorting photos into a handful of predefined classes. They gathered the information via crowdsourcing from captions and user updated attributes. Once the training data is ready, CNNS in the form of Alexnet is deployed to recognize the classes."

Photos for each business are classified and shown in different tabs in the page. The algorithm also picks the pictures to display on cover page which will improve user interaction with the pictures.

Reported Results

On an evenly split gold test set of 2,500 photos, Yelp's classifier achieved an overall precision of 94% of precision and recall of 70%

Yelp identified that to maximize user engagement one food photo and one non-food photo is optimal.


"Once we had our labeled data, we employed deep convolutional neural networks (CNNs) in the form of ”AlexNet” to recognize those classes. CNNs usually consist of a deep stack of multiple convolutional layers (for extracting spatially local and translation-invariant features), ReLU (Rectified Linear Units) layers (for non-saturating activations), pooling layers (for down-sampling and translation-invariance), local response normalization layers (for better generalization) and fully-connected layers as in conventional feedforward neural networks. Softmax outputs and regularization methods such as dropout are also commonly used. Our CNN was built on AWS EC2 GPU instances based on the Caffe framework.

Baseline is a “Caffe Classifier” that runs the CNN by means of Caffe; it’s a special form of an abstract classifier that can take different signals and perform different classification algorithms. Our current “facade” classifier is an ensemble that takes the weight average of classification results from two independently trained Caffe Classifiers."



Digital Marketing


Yelp, a crowd-sourced review forum, hosts millions of photos uploaded by users.



Photo captions, Yelp’s menu structures, Photo attributes

bottom of page