UWSpace is currently experiencing technical difficulties resulting from its recent migration to a new version of its software. These technical issues are not affecting the submission and browse features of the site. UWaterloo community members may continue submitting items to UWSpace. We apologize for the inconvenience, and are actively working to resolve these technical issues.
 

Evaluating the Usefulness of Synthetic Data in Healthcare: Applications in Predictive Modeling and Privacy Protection

Loading...
Thumbnail Image

Date

2024-04-24

Authors

Basri, Mohammad Ahmed

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

The advent of data-driven approaches in healthcare has opened new horizons for patient care, disease management, and medical research. However, one of the significant challenges is the availability of large-scale, high-quality datasets. Accessing health data that contains sensitive information requires lengthy approval processes and stringent restrictions. Synthetic data effectively addresses this dilemma by replicating the statistical properties of real datasets, offering a viable solution. Due to privacy concerns and regulatory restrictions associated with health data, there is a growing need for highly realistic synthetic health data, particularly in health data science initiatives. While significant advancements have been achieved in establishing recognized evaluation methods for synthetic data models, there remains a notable gap in understanding the optimal approaches to enhance the quality and usefulness of synthetic data. This thesis aims to bridge this gap by conducting a systematic evaluation of objective functions for hyperparameter tuning of synthetic data generation and studying the efficacy of synthetic data in predictive models. We evaluate synthetic data using three criteria: Fidelity, assessing how well it mirrors real-world data statistically; Utility, measuring its effectiveness for machine learning applications; and Privacy, evaluating the risk of re-identification. We examine the usefulness of synthetic data for the hyperparameter optimization process of predictive models, particularly in scenarios where access to real data is constrained. We found a notable correlation between model performance accuracy using real data and synthetic data, suggesting that parameters optimized with synthetic data are applicable to real data for optimal results. Our study confirms the feasibility of using synthetic data on external computing resources to optimize models, effectively addressing healthcare's computing constraints.

Description

Keywords

synthetic health data, machine learning, healthcare, public health, hyperparameter tuning, data utility, data privacy, data-driven healthcare, predictive healthcare analytics, model optimization, patient privacy, clinical data analysis, medical research data, health data science

LC Keywords

Citation