exKidneyBERT: A Language Model for Kidney Transplant Pathology Reports and the Crucial Role of Extended Vocabularies

dc.contributor.authorYang, Tiancheng
dc.date.accessioned2022-09-30T20:36:05Z
dc.date.available2022-09-30T20:36:05Z
dc.date.issued2022-09-30
dc.date.submitted2022-06-24
dc.description.abstractBackground: Pathology reports contain key information about the patient’s diagno- sis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and anal- ysis is often manual and tedious given their unstructured texts. Thus, an automated data extraction method from pathology reports would be of significant value and utility. Language modeling is useful for classifying and extracting information from natural lan- guage reports. Released in 2018, Bidirectional Encoder Representations from Transform- ers (BERT) achieved state-of-the-art performance on several natural language processing (NLP) tasks. Pre-training BERT to the task-specific domain usually improves the model performance. BioBERT was pre-trained with large biomedical corpora on BERT and out- performed BERT on biomedical NLP tasks. Clinical BERT pre-trained with clinical data on BioBERT achieved better results than BioBERT on clinical NLP tasks. It is not clear, however, whether pre-training on ever smaller training data sets is worthwhile. Objective: to develop a language model for renal transplant-pathology reports to extract the answers for two pre-defined questions. Methods: The study aimed to answer two pre-defined questions: 1) “What kind of rejection does the patient show?”; and 2)“What is the grade of interstitial fibrosis and tubu- lar atrophy (IFTA)?”. First, we followed the conventionally recommended procedure and pre-trained Clinical BERT further with the corpus which contains 3.4K renal transplant- reports and 1.5M words using Masked Language Modeling to obtain the Kidney BERT. Second, we hypothesize that the conventional pre-training procedure fails to capture the intricate vocabulary of narrow technical domains. We created extended Kidney BERT (exKidneyBERT) by extending the six words to the tokenizer of Clinical BERT and pre- trained with the same corpus as Kidney BERT on Clinical BERT. Third, all three models were fine-tuned with QA heads for the questions. Results: For the first question regarding rejection, the overlap ratio at word level for exKidneyBERT (83.3% for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR)) beats that of both Clinical BERT and Kidney BERT (46.1% for ABMR, and 65.2% for TCMR). For the second question regarding IFTA, the exact match rate of exKidneyBERT (95.8%) beats that of Kidney BERT ( 95.0%) and Clinical BERT (94.7%), Conclusion: When working in domains with highly specialized vocabulary, it is essen- tial to extend the vocabulary library of the BERT tokenizer to improve model performance. In this case, pre-training BERT language models for kidney pathology reports improved model performance even though the training data were relatively small.en
dc.identifier.urihttp://hdl.handle.net/10012/18862
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleexKidneyBERT: A Language Model for Kidney Transplant Pathology Reports and the Crucial Role of Extended Vocabulariesen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentStatistics and Actuarial Scienceen
uws-etd.degree.disciplineStatisticsen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0en
uws.contributor.advisorSchonlau, Matthias
uws.contributor.affiliation1Faculty of Mathematicsen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yang_Tiancheng.pdf
Size:
1000.82 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: