UWSpace is currently experiencing technical difficulties resulting from its recent migration to a new version of its software. These technical issues are not affecting the submission and browse features of the site. UWaterloo community members may continue submitting items to UWSpace. We apologize for the inconvenience, and are actively working to resolve these technical issues.
 

Impact of data quality on ML models: Improving data quality with Outlier Detection

Loading...
Thumbnail Image

Date

2024-04-15

Authors

Sharma, Rakshit

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

In the dynamic landscape of Machine Learning (ML) applications, data quality comes out to be an important factor that impacts the performance of ML models. Through this thesis, we present a study that proposes innovative methods for enhancing data quality through an iterative data recapture approach. This research primarily focuses on univariate time-series data where specific patterns can be extracted. We start by discussing existing data capture methods, where the data is collected manually or using some hardware devices. The proposed methods, namely Sessionized Recapture Strategy (SRS) and Robust Single Capture Method (RSCM), are meticulously detailed, offering distinct strategies for iterative data recapture. The Single Capture Method (SCM) and Recapture and Visualize Method (RVM) serve as the two baseline methods, with their data capture time and a consequential False Positive Rate (FPR). SRS is the enhancement of RVM, and RSCM is the enhancement of SCM. This thesis also introduces an outlier detection algorithm named Outlier detection through ParameterlEss Robust Algorithm (OPERA), which, when added with SCM and RVM, results in SRS and RSCM, respectively. Compared with the baseline methods, the proposed methods show promising results and improvement in the data quality of the captured data. The experiments are performed on two datasets: one dataset is captured in the Embedded Systems Lab on one of the ANVIL products for Future Technology Devices International (FTDI) chips, and the second dataset is Electrocardiogram (ECG), provided by PhysioNet and is publicly available. The research concludes with synthesizing key findings and recommendations for practitioners seeking to optimize model performance through enhanced data quality.

Description

Keywords

data capture verification, outlier detection, anomaly detection, parameterless, robust outlier detection, OPERA, data capture strategies, ANVIL, ECG5000, data capture issues

LC Keywords

Citation