AI Case Study

Columbia Business School researchers determine that unusual language in news stories can forecast future market stress through natural language processing

Columbia Business School researchers applied natural language processing techniques to news stories about corporations and assess the unusualness of the language to determine if investors utilise this information and the market adjust accordingly. They found a positive correlation but time lag between unusualness and story sentiment and market stress.


Consumer Goods And Services

Media And Publishing

Project Overview

The researchers use NLP techniques to classify the content in news articles about large financial companies from Thomson Reuters as "unusual" and then determine whether unsualness is related to company and aggregate market volatility.

"As a test of the economic impact of these effects, we modify a simple S&P 500 put selling strategy to take into account the information contained in our aggregate news measures.
To compare our macro results with micro results, we estimate a panel VAR for the corresponding company-specific variables — implied and realized volatility, positive and negative sentiment measures, and news measures interacted with unusualness. Impulse response functions again show that a shock to unusual negative (positive) news produces a statistically significant increase (decrease) in both implied and realized volatility."

Reported Results

The research found "unusual negative and positive news forecast volatility at both the company-specific and aggregate level. News shocks are impounded into volatility over the course of several months. This is a much longer time horizon than previous studies – which have focused on returns rather than volatility – have documented. The pattern of responses we find indicates that news is not absorbed by the market instantaneously, and the macro component of company-specific news is absorbed more slowly than the micro component. We argue that this type of response is consistent with investors who face constraints on the rate at which they can process information and who process micro information more easily than macro information."


"From a methodological perspective, our work applies two ideas from the field of natural language processing to text analysis in finance. As already noted, we measure the “unusualness” of language, and we do this through a measure of entropy in word counts. Also, we take consecutive strings of words (called n-grams) rather than individual words as our basic unit of analysis. In particular, we calculate the unusualness (entropy) of consecutive four-word sequences."


R And D

Core Research And Development


"Can the content of news articles forecast market stress and, if so, what type of content is predictive? Several studies have documented that news sentiment forecasts market returns. Research in finance and economics has commonly measured sentiment through what is known in the natural language processing literature as a bag-of-words approach: an article is classified as having positive or negative sentiment based on the frequency of positive or negative connotation words that it contains.... this approach misses important information: the unusualness of the first phrase lies not in its use of “collapse” or “Lehman” but in their juxtaposition. We therefore measure unusualness of consecutive word phrases rather than individual words."



Over 360,000 articles about top 50 global banks, insurance, and real estate firms by U.S. dollar market capitalization, published by Thomson Reuters from 1996–2014.

"Our market data comes from Bloomberg L.P. For each of the 50 companies in our sample we construct a U.S. dollar total returns series using Bloomberg price change and dividend yield data. Also, for those firms that have traded options, we use 30-day implied volatilities for at-the-money options from the Bloomberg volatility surfaces. The volatility data start in January 2005. Our single name volatility series are 20-day realized volatilities of local currency returns, as calculated by Bloomberg. Our macro data series are the Chicago Board Options Exchange Volatility Index (VIX) and 20-day realized volatility for the S&P 500 Index computed by Bloomberg from daily returns. Our S&P 500 level analysis starts in April 1998 (the first month when we have entropy measures) and our single-name analysis starts in January 2005 (the first month in which we have implied volatility data for firms)."