Murugan, Anand2024-05-282024-05-282024-05-282024-04-30http://hdl.handle.net/10012/20624Healthcare Machine Learning (HML) models are revolutionizing the healthcare industry, promising improved patient outcomes and enhanced public health. However, it is essential to ensure fairness, i.e., models delivering equitable performance to all individuals, irrespective of their inherent or acquired characteristics. This requires a thorough examination of the data used and the specific applications of these models. This study conducted a six-year systematic survey of models trained on the Medical Information Mart for Intensive Care (MIMIC) clinical research database (CRD) – one of the most popular and widely used HML databases to explore the link between data and fairness in HML. The results were striking: for the popular MIMIC IV – ICU mortality task, a naive baseline outperformed the state-of-the-art (SOTA) model in prediction performance, demonstrating greater fairness across subgroups (while still somewhat unfair). These findings demonstrate the urgent need to integrate fairness into healthcare machine learning models and a greater need to include practitioners in HML modeling. To achieve this, we propose a data-centric approach to fairness through our ‘Datasheet for MIMIC IV v2.0 CRD’, modeled after the recent works recommending datasheets for datasets. Given that MIMIC is large and complex, this datasheet will assist practitioners in identifying data anomalies and task-specific feature-target relationships during modeling, thereby fostering the development of equitable HML models.enFairnesshealthcare machine learningclinical research databasemedical information mart for intensive care (MIMIC)risk predictionDatasheet for MIMIC IV v2.0 CRDImplementing Fairness in Real-World Healthcare Machine Learning through Datasheet for DatabaseMaster Thesis