AI Case Study

Primer AI improves coverage of women of science using natural language processing

Primer AI's Quicksilver system was fed 30,000 English Wikipedia articles on scientists, over 3 million sentences from news documents describing them and their work and names and affiliations of 200,000 authors of scientific papers. It discovered that there are 40,000 people missing from Wikipedia even though they have a similar distribution of news coverage as those who do have articles. It also discovered that women were underrepresented and has seen been used in three English Wikipedia editathons to improve women scientists's coverage.

Industry

Consumer Goods And Services

Media And Publishing

Project Overview

"We are publicly releasing free-licensed data about scientists that we’ve been generating along the way, starting with 30,000 computer scientists. Only 15% of them are known to Wikipedia. The data set includes 1 million news sentences that quote or describe the scientists, metadata for the source articles, a mapping to their published work in the Semantic Scholar Open Research Corpus, and mappings to their Wikipedia and Wikidata entries. We will revise and add to that data as we go.

We trained Quicksilver’s models on 30,000 English Wikipedia articles about scientists, their Wikidata entries, and over 3 million sentences from news documents describing them and their work. Then we fed in the names and affiliations of 200,000 authors of scientific papers.

In the morning we found 40,000 people missing from Wikipedia who have a similar distribution of news coverage as those who do have articles. Quicksilver doubled the number of scientists potentially eligible for a Wikipedia article overnight.

It also revealed the second flavor of the recall problem that plagues human-generated knowledge bases: information decay. For most of those 30,000 scientists who are on English Wikipedia, Quicksilver identified relevant information that was missing from their articles.

Creating an article for a person is only the start. It must be maintained forever, updated as the world changes. The vast majority of information on Wikipedia is known to be correct and well cited, even after more than a decade of stunts and studies to prove otherwise. But as Fetahu et al. showed last year, Wikipedia lags significantly behind news about people and events.

Take for example Ana Mari Cauce, the president of the University of Washington. Her Wikipedia article went stale last year. Quicksilver discovers more recent information about Cauce’s defense of DACA students and her ongoing role in the free speech vs. hate speech battles on US campuses.

The Quicksilver output for Aleksandr Kogan is also enlightening. The psychologist’s English Wikipedia article was created in March 2018 when the Cambridge Analytica scandal blew up around him. But the trail goes cold on 26 April, the date of his article’s most recent reference. Quicksilver identifies an event for Kogan just four days later, the revelation that he had his hands on Twitter data as well. Twitter is now cracking down on data access.

Automatically generating Wikipedia-style articles is right at the edge of what is currently possible in natural language processing. It is usually framed as a multi-document summarization task: Given a set of reference documents that contain information about an entity, generate a summary of the entity."

Reported Results

Quicksilver " found 40,000 people missing from Wikipedia who have a similar distribution of news coverage as those who do have articles" and "was used in three English Wikipedia editathons for improving coverage of women of science."

Technology

"For Quicksilver’s architecture we started on the trail blazed by the Google AI team, but our goal is more practical. Rather than using Wikipedia as an academic testbed for summarization algorithms, we’re building a system that can be used for building and maintaining knowledge bases such as Wikipedia. We need to track data provenance so that any statement in the final text output can be referenced to its source. We also need structural data about entities and their relations so that we track changes of fact, not just text. And to achieve high precision, there just aren’t enough biographies with enough clean source document mappings to train today’s seq2seq models. They can’t learn the tacit knowledge required. What’s needed is a knowledge base coupled with a seq2seq model."

Function

Information Technology

Knowledge Management

Background

"Human-generated knowledge bases like Wikipedia have a recall problem. First, there are the articles that should be there but are entirely missing. The unknown unknowns.

Consider Joelle Pineau, the Canadian roboticist bringing scientific rigor to artificial intelligence and who directs Facebook’s new AI Research lab in Montreal. Or Miriam Adelson, an actively publishing addiction treatment researcher who happens to be a billionaire by marriage and a major funder of her own field. Or Evelyn Wang, the new head of MIT’s revered MechE department whose accomplishments include a device that generates drinkable water from sunlight and desert air. When I wrote this a few days ago, none of them had articles on English Wikipedia, though they should by any measure of notability."

Benefits

Data

"We trained Quicksilver’s models on 30,000 English Wikipedia articles about scientists, their Wikidata entries, and over 3 million sentences from news documents describing them and their work. Then we fed in the names and affiliations of 200,000 authors of scientific papers."