Using Natural Language Processing to Detect Breast Cancer Recurrence in Clinical Notes: A Hierarchical Machine Learning Approach

Subendran, Sujan

Using Natural Language Processing to Detect Breast Cancer Recurrence in Clinical Notes: A Hierarchical Machine Learning Approach

Files

Subendran_Sujan.pdf (787.24 KB)

Date

2021-04-26

Authors

Subendran, Sujan

Advisor

Chen, Helen

Publisher

University of Waterloo

Abstract

The vast amount of data amassed in the electronic health records (EHRs) creates needs and opportunities for automated extraction of information from EHRs using machine learning techniques. Natural language processing (NLP) has the potential to substantially reduce the burden of manual chart reviewing to extract risk factors, adverse events, or outcomes, that are documented in unstructured clinical reports and progress notes. In this thesis, an NLP pipeline was built using open-source software to process a corpus of electronic clinical notes extracted from an integrated health care system in Cancer Care Manitoba (CCMB) which contains a cohort of women with early-stage incident breast cancers. The goal is to identify whether and when recurrences were diagnosed. We developed and evaluated the system using 117,365 clinical notes from 892 patients receiving EHR-documented care at CCMB between 2004 to 2007. We used a hierarchical architecture, where a model is built to provide the patient-level recurrence status, then the NLP pipeline is used to detect notes which contains information about recurrence and the date of recurrence. Class imbalance was a significant issue as the proportion of positive to negative notes was at approximately 1:22 ratio. Various techniques including undersampling and cost-based classification were used to mitigate this issue. The XGBoost classifier was the best performing model which achieved a balanced accuracy of 0.924, with sensitivity of 0.867, specificity of 0.981, precision of 0.886 and ROC of 0.924. In addition, more data was collected from the years 2008 to 2012 in a similar cohort. This dataset was used to validate the performance of the models, which include 615 patients with 78,460 notes. The model performed well with a balanced accuracy of 0.909, sensitivity of 0.843, specificity of 0.974, precision of 0.575 and Area Under the ROC Curve (AUC) value of 0.909. The study has demonstrated the ability to use natural language processing and machine learning techniques to assist in chart review by 1) excluding a large amount of notes which contain no relevant information, 2) identifying notes that most likely contain relevant recurrence information, in order to accurately identify the timing of recurrence.