Using Machine Learning Algorithms for Finding the Topics of COVID-19 Open Research Dataset Automatically
Loading...
Date
2021-02-26
Authors
Hamzeian, Donya
Advisor
Ghodsi, Ali
Chen, Helen (Assistant Professor)
Chen, Helen (Assistant Professor)
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
The COVID-19 Open Research Dataset (CORD-19) is a collection of over 400,000 of scholarly papers (as of January 11th, 2021) about COVID-19, SARS-CoV-2, and related coronaviruses curated by the Allen Institute for AI. Carrying out an exploratory literature review of these papers has become a time-sensitive and exhausting challenge during the pandemic. The topic modeling pipeline presented in this thesis helps researchers gain an overview of the topics addressed in the papers. The preprocessing framework identifies Unified Medical Language System (UMLS) entities by using MedLinker, which handles Word Sense Disambiguation (WSD) through a pre-trained  Bidirectional  Encoder  Representations  from  Transformers  (BERT)  model.   The  topic model used in this research is a Variational Autoencoder implementing ProdLDA, which is an extension to the Latent Dirichlet Allocation (LDA) model. Applying the pipeline to the CORD-19 dataset achieved a topic coherence value of 0.7 and topic diversity measures of almost 100%.
Description
Keywords
machine learning, topic modelling, prodLDA, Latent Dirichlet Allocation, BERT, MedLinker, CORD-19, automatic exploratory literature review, scoping review
LC Subject Headings
COVID-19 Pandemic, 2020- , in mass media, Machine learning