AI Case Study

University of Glasgow researchers predict virus reservoir hosts with 83.5% accuracy and provide hypotheses about unknown viral vectors

Researchers at the Institute of Biodiversity, Animal Health and Comparative Medicine at the University of Glasgow use supervised learning to predict virus reservoir hosts with 83.5% accuracy. They then used the trained systems to hypothesise about potential virus vectors and hosts. This research helps lay groundwork for further discovery and provides early triggers for surveillance.

Industry

Healthcare

Pharmaceuticals And Biotech

Project Overview

The researchers used gradient boosting machines (GBM) as the most effective classified for identifying the most useful genomic traits of the ecology of viruses and predicting associated hosts. They "created a machine learning framework that leverages traits from individual viruses with network-derived information from their relatives to predict: (i) the reservoir hosts of 12 key groups of RNA viruses, (ii) whether their transmission involves an arthropod vector, and (iii) the identity of that vector."

After training their models, the researchers used them "to predict the natural epidemiology of viruses with previously unknown hosts (hereafter “orphan” viruses). As expected from the accuracy of our models on viruses with known hosts, model-projected reservoirs and vectors often matched those suspected from epidemiological investigations. For example, we predicted an artiodactyl reservoir for human enteric coronavirus 4408, a suspected spillover infection from cows into humans... For viruses without conjectured reservoirs or vectors, we generate candidates for prioritized surveillance. For example, Bas-Congo virus caused an outbreak of hemorrhagic fever in the Democratic Republic of the Congo and was detected in humans only (18). Our models predicted an artiodactyl reservoir, a high probability of arthropod-borne transmission, and midges as the likely vector of this emerging disease.

Reported Results

Gradient boosting machines were chosen as the most effective classified as by "combining selected genomic traits (SelGen) with viral PNs [phylogenetic neighborhoods] predicted reservoir hosts with up to 83.5% accuracy, distinguishing all 11 reservoir groups, including taxonomic divisions within the birds (i.e., Neoaves versus Galloanserae) and bats." Furthermore, the researchers used their methodology to identify potential vectors or reservoirs for viruses without identified ones. This can aid in discovery or prioritise surveillance for early detection.

Technology

The researchers used "supervised machine learning, a class of statistical models that can integrate multiple traits that carry a weak signal in isolation but build a strong signal when optimally weighted (12). Gradient boosting machines (GBMs) (13) outperformed seven alternative classifiers in predicting host associations from viral genomic biases and identified the most informative genomic traits for each aspect of viral ecology... We trained two additional sets of models that focused on arthropod-borne transmission (6). The first nearly perfectly identified which viruses were transmitted by arthropod vectors. Combined GBMs were most accurate overall (bagged accuracy = 97.0%). Only 5 out of 527 viruses were misclassified by all three GBMs (PN, SelGen, and combined), potentially reflect- ing uncertainty in some currently accepted trans- mission routes (supplementary text). The second set of models distinguished transmission by all four vector classes (bagged accuracy = 90.8%). Ranking traits according to their predictive power showed that midge and sandfly vectors were identified predominately from genomic biases, whereas mosquito and tick vectors were strongly correlated with viral phylogeny."

Function

R And D

Core Research And Development

Background

"Preventing emerging viral infections—including Ebola, SARS, and Zika—requires identi- fication of reservoir hosts and/or blood- feeding arthropod vectors that perpetuate viruses in nature. Current practice requires combining evidence from field surveillance, phylogenetics, laboratory experiments, and real- world interventions but is time consuming and often inconclusive. This creates prolonged periods of uncertainty that may amplify economic and health losses."

Benefits

Data

"We collected a single representative genome sequence per viral species or strain from 12 taxonomic groups (11 families and one order) of ssRNA viruses that can infect humans; that is, 80% of all human-infective groups. For each virus, we used extensive literature searches to determine currently accepted reservoir hosts (437 viruses, 11 reservoir groups), whether transmission involves an arthropod vector (527 viruses), and if so, the identity of arthropod vectors (98 viruses, four vector groups). To maximize predictive scope, reservoir and vector groups included the most- frequent sources of emerging human viruses as well as other common hosts in human-infective viral families (e.g., fish, plants, and insects)".