UWSpace is currently experiencing technical difficulties resulting from its recent migration to a new version of its software. These technical issues are not affecting the submission and browse features of the site. UWaterloo community members may continue submitting items to UWSpace. We apologize for the inconvenience, and are actively working to resolve these technical issues.
 

Using Natural Language Processing to Detect Breast Cancer Recurrence in Clinical Notes: A Hierarchical Machine Learning Approach

dc.contributor.authorSubendran, Sujan
dc.date.accessioned2021-04-26T13:39:44Z
dc.date.available2021-04-26T13:39:44Z
dc.date.issued2021-04-26
dc.date.submitted2021-04-15
dc.description.abstractThe vast amount of data amassed in the electronic health records (EHRs) creates needs and opportunities for automated extraction of information from EHRs using machine learning techniques. Natural language processing (NLP) has the potential to substantially reduce the burden of manual chart reviewing to extract risk factors, adverse events, or outcomes, that are documented in unstructured clinical reports and progress notes. In this thesis, an NLP pipeline was built using open-source software to process a corpus of electronic clinical notes extracted from an integrated health care system in Cancer Care Manitoba (CCMB) which contains a cohort of women with early-stage incident breast cancers. The goal is to identify whether and when recurrences were diagnosed. We developed and evaluated the system using 117,365 clinical notes from 892 patients receiving EHR-documented care at CCMB between 2004 to 2007. We used a hierarchical architecture, where a model is built to provide the patient-level recurrence status, then the NLP pipeline is used to detect notes which contains information about recurrence and the date of recurrence. Class imbalance was a significant issue as the proportion of positive to negative notes was at approximately 1:22 ratio. Various techniques including undersampling and cost-based classification were used to mitigate this issue. The XGBoost classifier was the best performing model which achieved a balanced accuracy of 0.924, with sensitivity of 0.867, specificity of 0.981, precision of 0.886 and ROC of 0.924. In addition, more data was collected from the years 2008 to 2012 in a similar cohort. This dataset was used to validate the performance of the models, which include 615 patients with 78,460 notes. The model performed well with a balanced accuracy of 0.909, sensitivity of 0.843, specificity of 0.974, precision of 0.575 and Area Under the ROC Curve (AUC) value of 0.909. The study has demonstrated the ability to use natural language processing and machine learning techniques to assist in chart review by 1) excluding a large amount of notes which contain no relevant information, 2) identifying notes that most likely contain relevant recurrence information, in order to accurately identify the timing of recurrence.en
dc.identifier.urihttp://hdl.handle.net/10012/16903
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectmachine learningen
dc.subjectnatural language processingen
dc.subjectbreast cancer recurrenceen
dc.subjectmedical chart abstractionen
dc.subjectimbalanced classificationen
dc.titleUsing Natural Language Processing to Detect Breast Cancer Recurrence in Clinical Notes: A Hierarchical Machine Learning Approachen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Applied Scienceen
uws-etd.degree.departmentSystems Design Engineeringen
uws-etd.degree.disciplineSystem Design Engineeringen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0en
uws.contributor.advisorChen, Helen
uws.contributor.affiliation1Faculty of Engineeringen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Subendran_Sujan.pdf
Size:
787.24 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: