Evaluating Synthetic Data as a Proxy for Real Clinical Data in Machine Learning Models: A Comparative Study on Postpartum Hemorrhage Prediction

Loading...
Thumbnail Image

Date

2024-08-30

Advisor

Chen, Helen

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Introduction: This thesis investigates the use of synthetic data as a proxy for real clinical data in predictive modeling, focusing on postpartum hemorrhage (PPH). Synthetic data offers a solution to privacy concerns by providing data that mimics real patient data without compromising patient information. The goal is to develop and validate predictive models for PPH using synthetic data and comparing it to the real data, thereby assessing the feasibility and effectiveness of synthetic data in clinical settings. Methods: Synthetic data was generated using Generative Adversarial Networks (GANs) from MDClone to replicate the statistical properties of real clinical data from Ottawa Hospital. The data underwent a thorough cleaning and preparation process, followed by feature selection. Machine learning and statistical models, including logistic regression, decision trees, random forests, and support vector machines, were developed and trained on the synthetic data and then the pipeline was run on the real data at Ottawa Hospital. Model performance was evaluated using precision, recall, F1-score, accuracy, and Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve. Results: The synthetic data closely mirrored the real data in statistical properties, with low Hellinger distances for most variables. Machine learning models trained on synthetic data demonstrated high performance, with comparable results to those trained on real data. Key predictors for PPH were determined which included the administration of certain medication and clinical parameters. The comparative analysis showed minimal discrepancies between model outputs from synthetic and real data, validating the use of synthetic data for predictive modeling. Discussion: The findings indicate that synthetic data can effectively be used to develop predictive models for PPH, and addressing data accessibility. The study highlights the potential of synthetic data to enhance predictive modeling in healthcare, providing a viable alternative to real data without compromising accuracy. The integration of synthetic data in clinical research can facilitate broader data availability, fostering innovation while adhering to privacy regulations. Conclusion: This research demonstrates the viability of synthetic data in predictive modeling for PPH, with models trained on synthetic data showing high performance comparable to those trained on real data. The study contributes to the theoretical understanding of synthetic data utility and offers practical implications for improving patient outcomes and optimizing healthcare resources. Future research should focus on expanding the use of synthetic data in other clinical areas and further validating its effectiveness in diverse healthcare settings.

Description

Keywords

Synthetic Data, GANS, Artificial Intelligence, AI, Machine Learning, Postpartum Haemorrhage, PPH, Synthetic Health Data, Synthetic Data utility Assessment, Synthetic Data Fidelity Assessment, Utility, Fidelity

LC Subject Headings

Citation