AI Case Study

Cambridge researchers develop a system to analyse cancer research papers to discover previously unexplored molecular biology links

Researchers at the University of Cambridge introduce a literature-based discovery (LBD) system to identify intermittently linked associations for cancer research in published literature. The LION LBD system uses convolutional neural networks and natural language processing to go through annotated databases of published research and come up with potential relations based on users searches. Results indicate at least a third of the proposed relations are useful and viable.



Pharmaceuticals And Biotech

Project Overview

LION LBD is "a literature-based discovery system that enables researchers to navigate published information and supports hypothesis generation and testing. The system is built with a particular focus on the molecular biology of cancer using state-of-the-art machine learning and natural language processing methods, including named entity recognition and grounding to domain ontologies covering a wide range of entity types and a novel approach to detecting references to the hallmarks of cancer in text. LION LBD implements a broad selection of co-occurrence based metrics for analyzing the strength of entity associations, and its design allows real-time search to discover indirect associations between entities in a database of tens of millions of publications while preserving the ability of users to explore each mention in its original context in the literature. Although machine learning methods trained on manually annotated resources are well established as outperforming other approaches in the recognition of biomedical entity mentions in text, there has been very limited application of these technologies in LBD systems to date."

Reported Results

"The overall results of this analysis indicate that at least a third of the candidates suggested by the system are likely to be of potential interest to users, a result we consider very positive in the challenging LBD task."


"To further account for cancer-related processes, we apply a dedicated machine learning system to categorize each sentence in the dataset according to the hallmarks of cancer (HoC) taxonomy of Baker et al. (2016, 2017), a 37-category hierarchical extension of the well-established cancer hallmarks of Hanahan and Weinberg (2000, 2011). The system classifies each sentence into zero or more of the 37 hallmark categories using a convolutional neural network."


R And D

Core Research And Development


"The enormous size and exponential growth of the scientific literature make it increasingly difficult for researchers to stay up to date on all developments in their field, let alone on those in related areas of study. This issue is particularly challenging in complex and tightly interconnected areas of biomedical research such as cancer, which is addressed in millions of existing publications. In the last two decades, there have been extensive efforts to address these challenges through the application of machine learning, natural language processing (NLP) and text mining methods to automate the processing of the biomedical scientific literature." LBD "seeks to uncover undiscovered public knowledge by connecting pieces of information from disjoint literatures. The key idea behind the original LBD formulation is that concepts that are never explicitly associated in the literature may be implicitly linked through intermediate concepts in disconnected subsets of that literature."



"The literature that the current release of the LION LBD system builds on is retrieved from PubMed ( and covers all of the nearly 27 million citations (titles and abstracts) in PubMed at the time of data import. We draw our annotations for physical biomedical entities, mutations and diseases from PubTator, an online annotation resource building on state-of-the-art methods for named entity recognition and grounding."