AI Case Study

Adobe Research team investigates current approaches in machine learning for automating data cleansing and finds them inadequatee

Researchers from Adobe Research investigate the ability of metric learning techniques to automatically clean data. These approaches assume that datasets can be described in pre-defined ways, and that cleaning methods will work similarly well for similarly defined datasets. Ultimately, however, they find this not to be the case with the implication being that current cleaning standards are inadequate.

Industry

Technology

Software And It Services

Project Overview

"Currently, metafeatures are chosen manually and a set of standard features are used in the meta-learning literature. These include simple counts such as numbers of samples & features and distributional statistics such as mean feature skew. When meta-learning components are employed in automated systems, it is assumed that these features retain a sufficient amount of information for distinguishing datasets in terms of which models are best suited for them. Our results suggest otherwise. Sequential model based optimization (SMBO) iteratively
updates an initial model seed (e.g., a data cleansing pipeline)
to improve downstream performance (e.g., classification error).
Using meta-learning to provide a model seed has been shown to significantly improve performance over randomly selected model seeds, however, we demonstrate this is likely an artifact of the meta-learning process and not a result of what meta-learning aims to promise."

Reported Results

"Discovering a more accurate representation could help close a large performance gap that we illuminate in our experiments. We also test several metric learning methods, however, none is able to shrink this gap. We conclude that the metafeatures circulated by the community are inadequate for this task and learning a better metafeature representation would benefit automated data cleansing".

Technology

"the choice of distance metric is typically ignored. By exploring a number of distance metrics, we expect we might discover one that better represents the metafeature manifold than L1 distance. We evaluated M data cleansing pipelines on N datasets and computed a metafeature vector for each of the N datasets. We used 22 standard metafeatures from the literature. We considered 192 different possible data cleansing pipelines each constituting a sequence of preprocessing components with hyperparameters (mean/mode imputation, feature selection by percentile, L1/L2 feature normalization, over/under sampling, PCA) in conjunction with L2-regularized logistic regression. We were then able to rank the M data cleansing pipelines by classification accuracy on each of the N datasets. To be able to compare pipeline performance across datasets, we scaled pipeline scores for each dataset to [0,1] (0 being the score for the worst pipeline).

Using an appropriate distance metric, we expect that two nearby datasets as measured by the metric, would rank the data cleansing pipelines similarly. More importantly, they would most likely agree on the top performing pipeline so that similar datasets should be cleaned similarly. For instance, an optimal metric would ensure the performance of neighbor’s recommended pipelines decayed monotonically with increasing dataset distance."

Function

R And D

Core Research And Development

Background

"Data preprocessing or cleansing is one of the biggest hurdles
in industry for developing successful machine learning applications. The process of data cleansing includes data imputation, feature normalization & selection, dimensionality reduction, and data balancing applications. Currently such preprocessing is manual. One approach for automating this process is meta-learning."

Benefits

Data

"We repeated this experiment for three different data sources (available on author’s website): one constructed by artificially creating 432 binary classification datasets using Scikit-Learn’s
make classification method, another using 816 binary classification tasks created from web activity of various companies which use Adobe’s digital marketing solutions, and a third consisting of 397 binary classification tasks downloaded from OpenML.org."