top of page

AI Case Study

Researchers identify healthcare knowledge gaps in Africa using search query data

Researcher Rediet Abebe from Cornell University alongside researchers at Microsoft Research, the Rockefeller Foundation and Stony Brook University have leveraged generative models to analyse Bing search data in African counties. The research projects involves a latent Dirichlet allocation (LDA) that automatically extracts topics from web searches to analyse knowledge gaps that the African population has on diseases such as HIV/AIDS, malaria, and tuberculosis. The findings of the research create a foundation for effective education and for guiding discussion on health policy.



Healthcare Providers And Services

Project Overview

"A cofounder of Black in AI who grew up in Ethiopia, [Rediet] Abebe is passionate about combining AI and data to help marginalized communities. Her work has grown from a 2016 project at Microsoft Research to explore the health needs of people in Africa.

Using topical models and natural language processing, Abebe combed through 18 months of Bing search results for all 54 nations on the African continent to assess queries related to HIV/AIDS, malaria, and tuberculosis. Automation then created categories based on subject matter. The total number of queries included in the paper were not disclosed.

Facebook AI researchers also used artificial intelligence for public health and aid organizations to create population density maps of Africa.

Results were published last year in a paper coauthored by Shawndra Hill and Jennifer Vaughan of Microsoft Research, along with Peter Small and Andrew Schwartz of Stony Brook University. The paper was recently accepted for publication by the International AAAI on Web and Social Media scheduled to take place in June in Munich, Germany.

The AI also categorizes words and topics most associated with specific diseases. For example, women were more interested in questions related to pregnancy or breastfeeding, while men were more interested in news stories about people who say they’ve been cured of HIV.

Search results demonstrated that women and users aged 18-24 are more concerned about stigma than other groups, and natural cure searches were highest in the 35-49 age group. Cure myths that often appear in search results include the prayers of Nigerian prophets, moringa seed oil, and garlic.

The results also highlight such questions as: “I’m HIV positive, can my boss fire me?” and “What are my legal protections?” Or “What ways can you mitigate stigma in social settings?”

Annotators with graduate level experience were then invited to examine topics like natural cures, symptoms, stigma, and drugs to assess the objectivity, accuracy, and relevance of results.

“What we found was that for searches related to natural cures and remedies, people were getting web pages that have serious issues with accuracy, effectiveness, and relevance,” Abebe said.

A correlation was found in the rate of stigma-related searches and high rates of HIV.

People with medical experience were asked to participate in this portion of the work in order to compensate for the lack of relevant health experience among AI researchers.

‘Grain of salt’ data

Abebe likes the use of search query results because, unlike a survey that asks pointed questions, search results are open-ended and can provide insights into people’s concerns and their lived experiences.

As for any data derived from the internet, there are a number of caveats, such as the fact that the majority of searches are in English; internet connection rates are rising fast, but there are still sizable portions of African nations that lack web access; and people self-identified by using the name of a disease in their searches.

The study also makes no attempt to follow subsequent searches to trace the evolution of search patterns.

So while Abebe shares results with public health officials in countries like Ethiopia, Ghana, Nigeria, and South Africa, she cautions that the results should never replace manually collected ground truth data and that any attempt to do so could be dangerous.

“It’s really more a guideline than it is ground truth that you can really rely on,” she said.

Attempts to create predictive systems based on search query results could also have an adverse impact on public health.

When sharing the results of the study with health officials in Africa, Abebe said, it became clear that some officials are well aware of the prevalence of results claiming Nigerian prophets, moringa seed oil, or garlic can cure HIV, but they were less informed about the questions related to discrimination or stigma."

Reported Results

"Combined, our results suggest that search data can help illuminate health information needs in Africa and inform discussions on health policy and targeted education efforts both on- and offline." (paper)


"We uncover themes in which individuals are interested using latent Dirichlet allocation (LDA), a standard generative model for automatically extracting topics from text." (paper)



"The lack of comprehensive, high-quality health data in developing nations creates a roadblock for combating the impacts of disease. One key challenge is understanding the health information needs of people in these nations. Without understanding people's everyday needs, concerns, and misconceptions, health organizations and policymakers lack the ability to effectively target education and programming efforts." (paper)



"Bing searches related to HIV/AIDS, malaria, and tuberculosis from all 54 African nations" (paper)

bottom of page