AI Case Study

UCL researchers optimise generated training set data for use in machine learning to decrease data requirements and improve model robustness

UCL researchers have developed an algorithm based on stochastic gradient descent that can generate data for training sets based on minimal input from real data. They find that optimal model robustness comes from a combination of real and generated data.


Public And Social Sector

Education And Academia

Project Overview

Researchers at UCL developed a generic algorithm called GENERE which "correctly identifies the correct data distribution, and hence leads to better generalization performances than what the model learns without the generator. To illustrate the benefit of using generative regularization, we considered a class of real world problems for which obtaining data is costly: learning to answer math exam problems. In this illustrative experiment, a simple text-to-equation translation problem is created, where inputs are sentences describing an equation such as 'compute one plus three minus two', and outputs are symbolic equations, such as 'X = 1 + 3 - 2'. The only publicly available word problem datasets we are aware of contain between 400 and 600 problems which is not enough to properly train sufficiently rich models that capture the link between the words and the quantities involved in the problem."

"We proposed to allow data generators to be 'weakly' specified, leaving the undetermined coefficients to be learned from data. We derived an efficient algorithm called GENERE, that jointly estimates the parameters of the model and the undetermined sampling coefficients, removing the need for costly cross-validation. While this procedure could be viewed as a generic way of building informative priors, it does not rely on a complex integration procedure such as Bayesian optimization, but corresponds to a simple modification of the standard stochastic optimization algorithms, where the sampling alternates between the use of real and generated data."

Reported Results

The research found that "mixing real and generated data improves performance significantly. When GENERE is used, the sampling is tuned to the problem at hand and give better generalization performance".


"The optimization algorithm was based on stochastic gradient descent using Adam as an adaptive step size scheme with mini-batches of size 32. A total of 256 epochs over the data was used in all the experiments."


R And D

Core Research And Development


"Many tasks require a large amount of background knowledge to be solved efficiently, but acquiring a huge amount of data is costly, both in terms of time and money." To get around these obstacles, data simulators can generate data for training, but have their own inherent limitations, such as exhibiting a strong model bias where the generated data differs from the observed.



712 math word problems for training from three datasets: 274 for development, and 571 for testing.

"As input to the encoder, we downloaded pre-trained 300-dimensional embeddings trained on Google News data using the word2vec software. The development data was used to tune these parameters before performing the evaluation on the test set. The optimal value for creating datasets was was 15% real data, 85% generated data, a fraction obtained on the development set."