top of page

AI Case Study

DeepMind achieves superhuman performance with its Go-playing AI program using reinforcement learning and neural nets

DeepMind previously created an AI program to play the game of Go, which successfully beat the human world champion. Building on that, researchers improved the program by reducing the amount of computing power needed while requiring no human supervision, and not using any handcrafted features. This new iteration has outperformed the strongest of its earlier versions.

Industry

Technology

Software And It Services

Project Overview

AlphaGo Zero is latest iteration of DeepMind's Go-playing computer program, and advances the previous efforts in significant ways. "AlphaGo Zero discovered a remarkable level of Go knowledge dur­ing its self­play training process. This included not only fundamental elements of human Go knowledge, but also non­standard strategies beyond the scope of traditional Go knowledge. It learns from self­-play reinforcement learning, starting from random initial weights, without using rollouts, with no human supervision and using only the raw board history as input features. It uses just a single machine in the Google Cloud with 4 TPUs. AlphaGo Zero is provided with perfect knowledge of the game rules. These are used during MCTS, to simulate the positions resulting from a sequence of moves, and to score any simulations that reach a terminal state. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted fea­tures, by a large margin."

Reported Results

AlphaGo Zero "achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo... Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains".

Technology

"The neural network is trained by a self-­play reinforcement learning algorithm that uses MCTS [Monte Carlo tree search] to play each move. We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approximately three days. We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days. Over the course of training, 29 million games of self­-play were gener­ated. Parameters were updated from 3.1 million mini­-batches of 2,048 positions each. The neural network contained 40 residual blocks.

The AlphaGo Zero self­-play algorithm can similarly be understood as an approximate policy iteration scheme in which MCTS is used for both policy improvement and policy evaluation. Policy improvement starts with a neural network policy, executes an MCTS based on that policy’s recommendations, and then projects the (much stronger) search policy back into the function space of the neural network. Policy evaluation is applied to the (much stronger) search policy: the outcomes of self-­play games are also projected back into the function space of the neural network. These projection steps are achieved by training the neural network parameters to match the search probabilities and self-­play game outcome respectively.

The algorithm was started with random initial parameters for the neural net­ work. The neural network architecture is based on the current state of the art in image recognition and hyperparameters for training were chosen accordingly. The input to the neural network is a 19 × 19 × 17 image stack comprising 17 binary feature planes. MCTS search parameters were selected by Gaussian process optimization, so as to optimize self­-play performance of AlphaGo Zero using a neural network trained in a preliminary run. For the larger run (40 blocks, 40 days), MCTS search param­eters were re­-optimized using the neural network trained in the smaller run (20 blocks, 3 days). The training algorithm was executed autonomously without human intervention."

Function

R And D

Core Research And Development

Background

"Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts. However, expert data sets are often expensive, unreliable or simply unavailable. Even when reliable data sets are available, they may impose a ceiling on the performance of systems trained in this manner. By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games, such as Atari, and 3D virtual environments. However, the most chal­lenging domains in terms of human intellect—such as the game of Go, widely viewed as a grand challenge for artificial intelligence—require a precise and sophisticated lookahead in vast search spaces."

Benefits

Data

The 2 datasets used for testing and validation are the KGS dataset and GoKifu dataset.

bottom of page