exKidneyBERT: A Language Model for Kidney Transplant Pathology Reports and the Crucial Role of Extended Vocabularies

Yang, Tiancheng

exKidneyBERT: A Language Model for Kidney Transplant Pathology Reports and the Crucial Role of Extended Vocabularies

dc.contributor.advisor	Schonlau, Matthias
dc.contributor.author	Yang, Tiancheng
dc.date.accessioned	2022-09-30T20:36:05Z
dc.date.available	2022-09-30T20:36:05Z
dc.date.issued	2022-09-30
dc.date.submitted	2022-06-24
dc.description.abstract	Background: Pathology reports contain key information about the patient’s diagno- sis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and anal- ysis is often manual and tedious given their unstructured texts. Thus, an automated data extraction method from pathology reports would be of significant value and utility. Language modeling is useful for classifying and extracting information from natural lan- guage reports. Released in 2018, Bidirectional Encoder Representations from Transform- ers (BERT) achieved state-of-the-art performance on several natural language processing (NLP) tasks. Pre-training BERT to the task-specific domain usually improves the model performance. BioBERT was pre-trained with large biomedical corpora on BERT and out- performed BERT on biomedical NLP tasks. Clinical BERT pre-trained with clinical data on BioBERT achieved better results than BioBERT on clinical NLP tasks. It is not clear, however, whether pre-training on ever smaller training data sets is worthwhile. Objective: to develop a language model for renal transplant-pathology reports to extract the answers for two pre-defined questions. Methods: The study aimed to answer two pre-defined questions: 1) “What kind of rejection does the patient show?”; and 2)“What is the grade of interstitial fibrosis and tubu- lar atrophy (IFTA)?”. First, we followed the conventionally recommended procedure and pre-trained Clinical BERT further with the corpus which contains 3.4K renal transplant- reports and 1.5M words using Masked Language Modeling to obtain the Kidney BERT. Second, we hypothesize that the conventional pre-training procedure fails to capture the intricate vocabulary of narrow technical domains. We created extended Kidney BERT (exKidneyBERT) by extending the six words to the tokenizer of Clinical BERT and pre- trained with the same corpus as Kidney BERT on Clinical BERT. Third, all three models were fine-tuned with QA heads for the questions. Results: For the first question regarding rejection, the overlap ratio at word level for exKidneyBERT (83.3% for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR)) beats that of both Clinical BERT and Kidney BERT (46.1% for ABMR, and 65.2% for TCMR). For the second question regarding IFTA, the exact match rate of exKidneyBERT (95.8%) beats that of Kidney BERT ( 95.0%) and Clinical BERT (94.7%), Conclusion: When working in domains with highly specialized vocabulary, it is essen- tial to extend the vocabulary library of the BERT tokenizer to improve model performance. In this case, pre-training BERT language models for kidney pathology reports improved model performance even though the training data were relatively small.	en
dc.identifier.uri	http://hdl.handle.net/10012/18862
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	exKidneyBERT: A Language Model for Kidney Transplant Pathology Reports and the Crucial Role of Extended Vocabularies	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	Statistics and Actuarial Science	en
uws-etd.degree.discipline	Statistics	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Schonlau, Matthias
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Yang_Tiancheng.pdf
Size:: 1000.82 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Statistics and Actuarial Science