top of page

AI Case Study

Researchers from MIT train AI to predict recipe ingredients by looking at a photo of food using machine vision and deep learning

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) in collaboration with Qatar Computing Research Institute (QCRI) and another researcher from Universitat Politècnica de Catalunya in Spain trained a system called Pic2Recipe to look at a photo of food and predict the ingredients of its recipe. The system uses neural embedding models with semantic regularization and when tested it showed promising results particularly in identifying ingredients of desserts while it showed limitations with more ambiguous foods like smoothies.


Public And Social Sector

Education And Academia

Project Overview

"The CSAIL team’s project aims to build off of this work but dramatically expand in scope. Researchers combed websites like All Recipes and to develop “Recipe1M,” a database of over 1 million recipes that were annotated with information about the ingredients in a wide range of dishes. They then used that data to train a neural network to find patterns and make connections between the food images and the corresponding ingredients and recipes.

Given a photo of a food item, Pic2Recipe could identify ingredients like flour, eggs, and butter, and then suggest several recipes that it determined to be similar to images from the database."

Reported Results

When tested, "the system did particularly well with desserts like cookies or muffins, since that was a main theme in the database. However, it had difficulty determining ingredients for more ambiguous foods, like sushi rolls and smoothies."


Neural embedding models with semantic regularization


R And D

Core Research And Development


"In 2014 Swiss researchers created the “Food-101” dataset and used it to develop an algorithm that could recognize images of food with 50 percent accuracy. Future iterations only improved accuracy to about 80 percent, suggesting that the size of the dataset may be a limiting factor.
Even the larger datasets have often been somewhat limited in how well they generalize across populations. A database from the City University in Hong Kong has over 110,000 images and 65,000 recipes, each with ingredient lists and instructions, but only contains Chinese cuisine."



"The recipes were scraped from over two dozen popular cooking websites and processed through a pipeline that extracted relevant text from the raw HTML, downloaded linked images, and assembled the data into a compact JSON schema in which each datum was uniquely identified. As part of the extraction process, excessive whitespace, HTML entities, and non-ASCII characters were removed from the recipe text.

The contents of the Recipe1M dataset may logically be grouped into two layers. The first contains basic information including title, a list of ingredients, and a sequence of instructions for preparing the dish; all of these data are provided as free text. The second layer builds upon the first and includes any images with which the recipe is associated–these are provided as RGB in JPEG format. Additionally, a subset of recipes are annotated with course labels (e.g., appetizer, side dish, dessert)."

bottom of page