AI Case Study

Stanford University researchers improve weather forecasting by over 40% on historic data with machine learning

Researchers from Stanford University and other institutions develop a new machine learning method for predicting temperature and precipitation in the United States 2-4 weeks in advance. The model combines linear regression and k-nearest neighbours with multitask steps and improved upon traditional model accuracy by 40-169% for the years 2011-2018.


Public And Social Sector

Public Services

Project Overview

According to the researchers: "Our system is an ensemble of two regression models. The first integrates the diverse collection of meteorological measurements and dynamic model forecasts in the SubseasonalRodeo dataset and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone is significantly more accurate than the debiased operational U.S. Climate Forecasting System (CFSv2),
and our ensemble skill exceeds that of the top Rodeo competitor for each target variable and forecast horizon."

Reported Results

This method "demonstrated 40-169% improvements in forecasting skill across the challenge period (2017-18) and the years 2011-18 more generally. Notably, the same procedures provide these improvements for each of the four Rodeo prediction tasks (forecasting temperature or precipitation at weeks 3-4 or weeks 5-6). In the short term, we anticipate that these improvements will benefit disaster management (e.g., anticipating droughts, floods, and other wet weather extremes) and the water management, development, and protection operations of the USBR more generally (e.g., providing irrigation water to 20% of western
U.S. farmers and generating hydroelectricity for 3.5 million homes)."


From the research paper: "Our subseasonal ML system is an ensemble of two regression models: a local linear regression model with multitask model selection (MultiLLR) and a weighted local autoregression enhanced with multitask k-nearest neighbor features (AutoKNN). The MultiLLR model introduces candidate regressors from each data source in the SubseasonalRodeo dataset and then prunes irrelevant predictors using a multitask backward stepwise criterion designed for the forecasting skill objective. The AutoKNN model extracts features only from the target variable (temperature or precipitation), combining
lagged measurements with a skill-specific form of nearest-neighbor modeling".


R And D

Core Research And Development


"Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two to four weeks and four to six weeks in advance."



The researchers built their own SubseasonalRodeo dataset from a variety of sources, including daily temperature measurements and precipitation amounts dating back to 1979, sea surface temperature and sea ice concentration, multivariate ENSO index data, Madden-Julian oscillation data, relative humidity and pressure values, daily mean geopotential height, and forecasts from the North American Multi-Model Ensemble.