AI Case Study
Instapaper reviews saved articles and generates clean text ready to be read through their app using natural language processing
Instapaper uses Diffbot to curate content from web articles and newspapers and saves metadata to auto-sync with the app. Diffbot uses NLP and machine learning to extract content from web pages.
Industry
Consumer Goods And Services
Media And Publishing
Project Overview
"Instapaper uses Diffbot to extract data from articles and newspapers.
Diffbot is a developer of machine learning and computer vision algorithms and public APIs for extracting data from web pages / web scraping.
The Diffbot approach is unique in that it relies on computer-vision techniques (in conjunction with machine learning, NLP, DOM inspection) as the primary engine in identifying the proper content to extract from a page. What this means: when analyzing a web document, our system renders the page fully, as it appears in a browser -- including images, CSS, even Ajax-delivered content.
This full rendering allows Diffbot to break a page down into its constituent visual components. Then using machine-learning-trained algorithms (trained against tens of thousands of marked-up, rendered pages), these elements will be weighted for their likelihood in being various components of a page: title, author, related image, full text, sharing icons, next-page link, etc. As part of this the content and markup within and surrounding each element will also be evaluated to help further identify the right elements. Within each of these, again our machine-learning-trained algorithms will be used to identify likelihood of each block based on element content, surrounding markup, etc.
Finally the unrelated components will be discarded and the identified elements will be processed (extraneous text or inline elements removed; HTML normalized; date normalized; image-headers scanned; etc.) and concatenated into our JSON response format.
Our visual approach allows Diffbot to work very well on non-English pages, since visual structure of a page tends to be similar regardless of a page's written language."
Reported Results
According to the company:
* Precision score of .968
* F1 Score of 0.971
Technology
Function
Digital Data
Digital Data Management
Background
Instapaper is a bookmarking service owned by Pinterest. It allows web content to be saved so it can be "read later" on a different device, such as an e-reader, smartphone, tablet.
Benefits
Data
"Automatic data extraction from articles, products, discussions and more. "