Scaling Machine Learning Data Repair Systems for Sparse Datasets

Loading...
Thumbnail Image

Date

2021-01-21

Authors

Attia, Omar

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Machine learning data repair systems (e.g. HoloClean) have achieved state-of-the-art performance for the data repair problem on many datasets. However, these systems face significant challenges with sparse datasets. In this work, the challenges presented by such datasets to machine learning data repair systems are investigated. Dataset-independent methods are presented to mitigate the effects of data sparseness. Finally, experimental results are validated on a large, sparse real-world dataset: Census. Showing that the problem size can be reduced by more than 70%, saving significant computational costs, while still getting high accuracy data repairs (94.5% accuracy).

Description

Keywords

data cleaning, data imputation, machine learning, sparse data, structured data, data quality, data science

LC Keywords

Citation