Show simple item record

dc.contributor.authorSubendran, Sujan
dc.date.accessioned2021-04-26 13:39:44 (GMT)
dc.date.available2021-04-26 13:39:44 (GMT)
dc.date.issued2021-04-26
dc.date.submitted2021-04-15
dc.identifier.urihttp://hdl.handle.net/10012/16903
dc.description.abstractThe vast amount of data amassed in the electronic health records (EHRs) creates needs and opportunities for automated extraction of information from EHRs using machine learning techniques. Natural language processing (NLP) has the potential to substantially reduce the burden of manual chart reviewing to extract risk factors, adverse events, or outcomes, that are documented in unstructured clinical reports and progress notes. In this thesis, an NLP pipeline was built using open-source software to process a corpus of electronic clinical notes extracted from an integrated health care system in Cancer Care Manitoba (CCMB) which contains a cohort of women with early-stage incident breast cancers. The goal is to identify whether and when recurrences were diagnosed. We developed and evaluated the system using 117,365 clinical notes from 892 patients receiving EHR-documented care at CCMB between 2004 to 2007. We used a hierarchical architecture, where a model is built to provide the patient-level recurrence status, then the NLP pipeline is used to detect notes which contains information about recurrence and the date of recurrence. Class imbalance was a significant issue as the proportion of positive to negative notes was at approximately 1:22 ratio. Various techniques including undersampling and cost-based classification were used to mitigate this issue. The XGBoost classifier was the best performing model which achieved a balanced accuracy of 0.924, with sensitivity of 0.867, specificity of 0.981, precision of 0.886 and ROC of 0.924. In addition, more data was collected from the years 2008 to 2012 in a similar cohort. This dataset was used to validate the performance of the models, which include 615 patients with 78,460 notes. The model performed well with a balanced accuracy of 0.909, sensitivity of 0.843, specificity of 0.974, precision of 0.575 and Area Under the ROC Curve (AUC) value of 0.909. The study has demonstrated the ability to use natural language processing and machine learning techniques to assist in chart review by 1) excluding a large amount of notes which contain no relevant information, 2) identifying notes that most likely contain relevant recurrence information, in order to accurately identify the timing of recurrence.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectmachine learningen
dc.subjectnatural language processingen
dc.subjectbreast cancer recurrenceen
dc.subjectmedical chart abstractionen
dc.subjectimbalanced classificationen
dc.titleUsing Natural Language Processing to Detect Breast Cancer Recurrence in Clinical Notes: A Hierarchical Machine Learning Approachen
dc.typeMaster Thesisen
dc.pendingfalse
uws-etd.degree.departmentSystems Design Engineeringen
uws-etd.degree.disciplineSystem Design Engineeringen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeMaster of Applied Scienceen
uws-etd.embargo.terms0en
uws.contributor.advisorChen, Helen
uws.contributor.affiliation1Faculty of Engineeringen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages