AI Case Study

Instapaper reviews saved articles and generates clean text ready to be read through their app using natural language processing

Instapaper uses Diffbot to curate content from web articles and newspapers and saves metadata to auto-sync with the app. Diffbot uses NLP and machine learning to extract content from web pages.

Industry

Consumer Goods And Services

Media And Publishing

Project Overview

"Instapaper uses Diffbot to extract data from articles and newspapers.

Diffbot is a developer of machine learning and computer vision algorithms and public APIs for extracting data from web pages / web scraping.

The Diffbot approach is unique in that it relies on computer-vision techniques (in conjunction with machine learning, NLP, DOM inspection) as the primary engine in identifying the proper content to extract from a page. What this means: when analyzing a web document, our system renders the page fully, as it appears in a browser -- including images, CSS, even Ajax-delivered content.

This full rendering allows Diffbot to break a page down into its constituent visual components. Then using machine-learning-trained algorithms (trained against tens of thousands of marked-up, rendered pages), these elements will be weighted for their likelihood in being various components of a page: title, author, related image, full text, sharing icons, next-page link, etc. As part of this the content and markup within and surrounding each element will also be evaluated to help further identify the right elements. Within each of these, again our machine-learning-trained algorithms will be used to identify likelihood of each block based on element content, surrounding markup, etc.

Finally the unrelated components will be discarded and the identified elements will be processed (extraneous text or inline elements removed; HTML normalized; date normalized; image-headers scanned; etc.) and concatenated into our JSON response format.

Our visual approach allows Diffbot to work very well on non-English pages, since visual structure of a page tends to be similar regardless of a page's written language."

Reported Results

According to the company:

* Precision score of .968
* F1 Score of 0.971

Technology

Function

Digital Data

Digital Data Management

Background

Instapaper is a bookmarking service owned by Pinterest. It allows web content to be saved so it can be "read later" on a different device, such as an e-reader, smartphone, tablet.

Benefits

Data

"Automatic data extraction from articles, products, discussions and more. "