AI Case Study

Fast.AI more accurately classifies text while requiring less training data due to a new natural language processing technique

Fast.AI has introduced a novel method to decrease the amount of language training date needed to effectively classify things by using natural language processing (NLP), while increasing classification accuracy at the same time.

Industry

Technology

Software And It Services

Project Overview

According to fast.ai: "Our goal was to address these two problems: a) deal with NLP problems where we don’t have masses of data and computational resources, and b) make NLP classification easier... we found that if we carefully control how fast our model learns and update the pre-trained model so that it does not forget what it has previously learned, the model can adapt a lot better to a new dataset".

From the arXiv research paper: "We propose Universal Language Model Finetuning (ULMFiT), which pretrains a language model (LM) on a large general-domain corpus and fine-tunes it on the target task using novel techniques. The method is universal in the sense that it meets these practical criteria: 1) It works across tasks varying in document size, number, and label type; 2) it uses a single architecture and training process; 3) it requires no custom feature engineering or preprocessing; and 4) it does not require additional in-domain documents or labels".

Reported Results

The method as described in the arXiv paper "significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data."

Technology

Natural language processing using transfer learning on an AWD-LSTM language model (supervised and semi-supervised learning)

Function

R And D

Core Research And Development

Background

From fast.ai: "NLP is used in a wide variety of applications, such as search, personal assistants, summarization, etc. Overall, NLP is challenging as the strict rules we use when writing computer code are a poor fit for the nuance and flexibility of language. A common feature of successful NLP tasks is that large amounts of labeled data are available for training a model. However, until now such applications were limited to those institutions that were able to collect and label huge datasets and had the computational resources to process them on a cluster of computers for a long time. One particular area that is still challenging with deep learning for NLP, curiously enough, is the exact area where it’s been most successful in computer vision: classification. This refers to any problem where your goal is to categorize things (such as images, or documents) into groups (such as images of cats vs dogs, or reviews that are positive vs negative, and so forth)."

Benefits

Data

Training set of 100 labeled examples and about 50,000 unlabeled examples; Wikitext 103 dataset, which contains a pre-processed large subset of English Wikipedia. From the arXiv paper: "For sentiment analysis, we evaluate our approach on the binary movie review IMDb dataset and on the binary and five-class version of the Yelp review dataset. For question classification six-class version of the small TREC dataset dataset of open-domain, fact-based questions divided into broad semantic categories. For topic classification, we evaluate on the large-scale AG news and DBpedia ontology datasets."