AI Case Study
Researchers detect restaurants with elevated risk of foodborne illness in real time using machine learning
Researchers at Harvard T.H. Chan School of Public Health, Google, Southern Nevada Health District, Chicago Department of Public Health, Chicago Department of Innovation and Technology, and Veterans Affairs Boston Healthcare System have developed the Foodborne IllNess DEtector in Real time (FINDER). The detector is a system aimed to work out the restaurants that are potentially dangerous to consumers for causing foodborne illnesses. With the use of machine learning, it analyses Google search and anonymous location logs to detect queries indicative of foodborne illness. It then matches those queries with the user's location data and identify a potentially dangerous restaurant. FINDER was deployed in Las Vegas and Chicago in 2016 and 2017 respectively. In Chicago, FINDER prompted 71 of the 5880 inspections that were carried out, of which 52.3% were deemed unsafe upon inspection, compared to 24.7% for all inspected restaurants not prompted by FINDER.
Industry
Public And Social Sector
Education And Academia
Project Overview
"In the 1800s, John Snow had to go door to door during an epidemic of cholera to uncover its mechanisms of spread.1 He recorded where people were getting their drinking water from in order to pinpoint the source of the outbreak. Here we scale up this approach using machine learning to detect potential sources of foodborne illness in real time.
Here we sought to test the efficacy of a machine-learned model that uses aggregated and anonymized Google search and location data to detect potential sources of foodborne illness in real time. Our primary goal was to use this model to identify restaurants with potentially unsafe health code violations that could contribute to foodborne illness spread, with the hypothesis that our model would be able to more accurately identify a restaurant with serious health code violations than systems currently in place. We find that such an approach can lead to a greater than threefold improvement in identifying potentially problematic venues over current approaches, including a 68% improvement over an advanced complaint-based system that already utilizes Twitter data mining. This model can be expanded by public health departments to reduce the burden of foodborne illness across the United States, and can also be expanded to assist in monitoring a variety of other diseases globally.
Here we introduce a machine-learned model called FINDER (Foodborne IllNess DEtector in Real time), which detects restaurants with elevated risk of foodborne illness in real time. The model leverages anonymous aggregated web search and location data and ensures that specific findings cannot be attributed to individual users. We call this approach machine-learned epidemiology. It complements existing approaches to identifying illnesses with new real-time signals available at large scale.
FINDER applies machine learning to Google search and location logs to infer which restaurants have major food safety violations, which may be causing foodborne illness. This anonymous and aggregated logs data comes from users who opted to share their location data, which already enables other applications, such as estimates of live traffic.
Our method first identifies queries indicative of foodborne illness, and then looks up restaurants visited in aggregate by the users who issued those queries, leveraging their anonymized location history. FINDER then calculates, for each applicable restaurant, the proportion of users who visited it and later showed evidence of foodborne illness in their searches. Notably, in most previous work, a user’s location is only known if she searched or posted a message from the location.11,12 In contrast, our data source is much more comprehensive, allowing us to reliably infer previously visited locations, regardless of whether the user took any action there.
The key challenge is the inherent noise and ambiguity of individual search queries. For example, the query [diarhea] could be related to food poisoning, but also contains a typo and does not convey information about the details of the symptom (e.g., what type of diarrhea, is it experienced by the user or her family member). We solve this challenge with a privacy-preserving supervised machine-learned classifier, which leverages a collection of signals beyond the query string itself, such as search results shown in response to the query, 13 aggregated clicks on those results, and the content of the opened web pages. The resulting classifier has high accuracy in identifying queries related to food poisoning, achieving area under the ROC curve of 0.85, and F1 score of 0.74 in evaluation with three independent medical doctors and separately with three non-medical professionals rating each query. Note that an individual affected by foodborne illness starts feeling symptoms with certain delay (incubation period) after the infection has occurred. While FINDER processes log data in real time, confident inference can only be drawn after incubation period has elapsed for an initial cohort of affected patrons.
FINDER was deployed in Las Vegas between May and August 2016; during that period a total of 5038 inspections were completed, 61 of which were prompted by FINDER (Table 1). A similar deployment occurred in Chicago between November 2016 and March 2017, where 5880 inspections were completed, 71 of which were prompted by FINDER. Of the inspections not attributed to FINDER, 1291 inspections were driven by complaints through the existing systems in Chicago.
We assessed the accuracy of FINDER’s predictions by comparing the fraction of unsafe restaurants it identified to the fraction of unsafe venues found in all the other restaurant inspections conducted during the experimentation period (BASELINE), as well as the fraction of unsafe venues found in the two subgroups, COMPLAINT and ROUTINE.
Of all the restaurants identified by FINDER, 52.3% were deemed unsafe upon inspection, compared to 24.7% for BASELINE restaurants (Table 2). We used binomial logistic regression to determine the odds ratio of being unsafe for restaurants in the FINDER and BASELINE groups. The former were 3.06 times (95% CI: 2.14–4.35) as likely to be unsafe as the latter, when accounting for restaurant risk level and city in our models (p < 0.001, Table 2). When stratified by restaurant risk level, FINDER restaurants were more likely to be designated unsafe across all risk designations, however the odds of being identified by FINDER as unsafe was higher in lower risk-level restaurants than in high risk-level restaurants (Table 2). Importantly, this suggests that a priori determination of the restaurant risk level might not necessarily reflect the true level of risk at the venue.
Reported Results
"Of all the restaurants identified by FINDER, 52.3% were deemed unsafe upon inspection, compared to 24.7% for BASELINE restaurants (Table 2). We used binomial logistic regression to determine the odds ratio of being unsafe for restaurants in the FINDER and BASELINE groups. The former were 3.06 times (95% CI: 2.14–4.35) as likely to be unsafe as the latter, when accounting for restaurant risk level and city in our models (p < 0.001, Table 2)."
Technology
Function
R And D
Core Research And Development
Background
"Machine learning has become an increasingly common artificial intelligence tool and can be particularly useful when applied to the growing field of syndromic surveillance. Frequently, syndromic surveillance depends upon patients actively reporting symptoms that may signal the presence of a specific disease.2,3 In recent years, syndromic surveillance has also begun to include passively collected information, such as information from social media, which can also lend insight into potential disease outbreaks.4,5,6 In this study, we use such observational data to identify instances of foodborne illness at scale."
Benefits
Data
"FINDER applies machine learning to Google search and location logs to infer which restaurants have major food safety violations, which may be causing foodborne illness. This anonymous and aggregated logs data comes from users who opted to share their location data, which already enables other applications, such as estimates of live traffic."