AI Case Study

Indix automates and standardises product information retrieved from a variety of sources using machine learning

Indix uses machine learning techniques to sort through product data collected from non-standardised and disparate sources for inclusion in its product catalogue.



Internet Services Consumer

Project Overview

"The Indix Cloud Catalog is a comprehensive collection of structured product information." Indix gets product data by crawling website or getting feeds directly from providers. Once the data has been collected it needs to be cleaned and product data classified in order to standardise entries in the catalogue, since the data sources are disparate and many contain incomplete or missing data. This categorisation is automated using machine learning, with 23 top-level categories and 7000+ leaf-level categories.

Reported Results

The use of machine learning allows Indix to create a standardised catalogue where product data is displayed consistently and automatically.


Offline prediction used to categorise existing products and online prediction is used for new products being entered into the system. The final leaf category decision is arrived at through a combination of different predictions. The model is trained using linear support vector machines and one-vs-rest.


Information Technology

Data Management


According to the Indix blog: "Businesses want to know what products and services are available on the market, who else is selling those products, for how much, in which stores, with what success. They want to know what customers are saying about products and services. They would like to know how products and services are being promoted and when. Also news that could potentially affect the availability, price or popularity of products would be valuable in planning. Much of this information is public. But that public data is massive and chaotic." Indix compiles product data from the internet to create a comprehensive catalogue, or what it calls "the world’s first Product Information Marketplace".



From Indix's blog: "Almost every single piece of information available on a product page can be used as a signal to identify product classification: title, images, site breadcrumbs, descriptions, attributes". Training datasets contain labelled and pre-processed data of "a few millions". In total there are over 1.5 billion products in the Indix catalogue.

