AI Case Study

Researchers from Tampere University apply machine learning to Instagram posts to predict the spread of influenza in Finland

Researchers from Tampere University investigate using Instagram posts to predict the spread of influenza-like symptoms in Finland. Using image recognition to automatically classify picture posts which may contain flu-related content, they explore different machine learning algorithms and discover a form of gradient tree boosting the most successful in predicting the prevalence of flu.

Industry

Public And Social Sector

Education And Academia

Project Overview

The researchers' goal was "to assess the predictive power of an alternative data source, Instagram. By using 317 weeks of publicly available data from Instagram, we trained several machine learning algorithms to both nowcast and forecast the number of official influenza-like illness incidents in Finland where population-wide official statistics about the weekly incidents are available. In addition to date and hashtag count features of online posts, we were able to utilize also the visual content of the posted images with the help of deep convolutional neural networks."

Reported Results

"Forecasting models for predicting 1 week and 2 weeks ahead showed statistical significance as well by reaching correlation coefficients of 0.903 and 0.862, respectively. This study demonstrates how social media and in particular, digital photographs shared in them, can be a valuable source of information for the field of infodemiology.

Overall, we show that Instagram can be considered as a
significant source of information for Internet-based monitoring
and forecasting of influenza epidemics. Furthermore, we show
that the visual content of the posted images can also be
utilized as input features with the help of a deep convolutional
neural network, increasing the prediction performance. A
mean absolute error of 11.33 incidents per week and Pearson’s
correlation of 0.963 were achieved with XGBoost algorithm
when several modalities of Instagram posts (date, count,
image) have been used as an input for nowcasting the official
influenza-like illness counts in Finland."

Technology

"We selected 4 reference images that contributed to the
definition of 4 image features. The images were collected
using Google Images search engine and all of them are
released under public domain enabling full permission for
usage as is. The images were searched using terms ”boxes of
drugs/medicine”, ”boxes of drugs/medicine and pills”, ”mint,
ginger and lemons” and ”ginger and lemon tea”, respectively... In order to count the weekly number of images similar to reference images on Instagram, we employed a pretrained deep convolutional neural network (CNN) model, i.e., Inception-ResNet-v2 [59]. The model is 164 layers deep and has been pretrained on the well known ImageNet dataset.

We trained 9 different machine learning algorithms for
nowcasting the official weekly ILI counts in Finland. These
algorithms include linear regression (also known as ordinary
least squares), ridge regression, elastic net, LASSO, k-nearest
neighbor regression, support vector machine, random forest,
AdaBoost and XGBoost. For each algorithm, we used date and
count features for modeling."

Function

Strategy

Data Science

Background

"Conventional surveillance systems for monitoring
infectious diseases, such as influenza, face challenges due to shortage of skilled healthcare professionals, remoteness of communities and absence of communication infrastructures. Internet-based approaches for surveillance are appealing logistically as well as economically. Influenza epidemics possess certain easily identifiable characteristics, which have allowed their
identification throughout history. These characteristics include
immense attack rates and explosive spread of the disease."

Benefits

Data

"317 weeks of publicly available data from Instagram", including images, hashtags, and dates. "For this study, weekly ILI incidents reported by public primary healthcare register in Finland between the dates 30 April 2012 and 27 May 2018 (in total of 317 weeks) were used. The data is publicly available and accessible [57].
2) Instagram data: We identified 7 keywords in Finnish
language to be searched from the hashtags of the Instagram
posts, namely cough, fever, flu, influenza, muscle ache, sick,
throat ache. These keywords correspond to the most common
symptoms of ILI and we hypothesized that they would be often
used in social media posts associated with ILI. We collected
publicly available Instagram posts containing at least one of
these hashtags between the dates 30 April 2012 and 27 May
2018 (in total of 317 weeks).

We used weekly data from 30 April 2012 to 22 May
2017 (265 weeks) as the training data, i.e., hyper-parameter
optimization and model comparison. In order to report the performance of the trained models, data from one year was used
as the test (hold-out) data, i.e., weekly data from 29 May 2017
to 27 May 2018 (52 weeks)".