AI Case Study

Microsoft Research predicts underperforming biotech companies with 62% accuracy in tests natural language processing

Microsoft has conducted research on the ability to use NLP to ascertain from public company documents whether biotech stocks would under or over perform in the short-term, achieving a 62% accuracy in predicting underperformers.



Software And It Services

Project Overview

"The goal was to use select text narrative sections from publicly available earnings release documents to predict and alert their analysts to investment opportunities and risks. For this project, we sought to prototype a predictive model to render consistent judgments on a company’s future prospects, based on the written textual sections of public earnings releases extracted from 10k releases and actual stock market performance. We leveraged natural language processing (NLP) pre-processing and deep learning against this source text. We modeled our prototype on just one industry, the biotechnology industry, which had the most abundant within-industry sample. Our project goal was to discern whether we could outperform chance accuracy of 33.33%"

Reported Results

The model performed best for predicting under-performing biotech companies with 62% accuracy (chance would be 33.3%).


NLP; "deep learning model using a one-dimensional convolutional neural network". Azure Machine Learning Workbench with a Python framework and Theano backend.


R And D

Core Research And Development


"When reviewing investment decisions, a firm needs to utilize all possible information, starting with publicly available documents like 10-K reports. However, reviewing public earnings release documents is time-intensive and the resulting analysis can be subjective. Moreover, the written sections of an earnings release require the most review time and are often the most subjective to interpretation. A thorough analysis of the investment opportunity of a business would also include a review of other companies in the industry to understand relative performance."



Less than 35,000 individual text document samples: "text corpus of two years of earnings release information for thousands of public companies worldwide...stock price of each of the companies on the day of the earnings release and the stock price four weeks later."
For creating the word embeddings "GloVe pre-trained model of all of Wikipedia’s 2014 data, a six billion token, 400,000-word vocabulary vector model" were used.