Using Machine Learning Algorithms for Finding the Topics of COVID-19 Open Research Dataset Automatically

dc.contributor.authorHamzeian, Donya
dc.date.accessioned2021-02-26T19:23:58Z
dc.date.available2021-02-26T19:23:58Z
dc.date.issued2021-02-26
dc.date.submitted2021-02-23
dc.description.abstractThe COVID-19 Open Research Dataset (CORD-19) is a collection of over 400,000 of scholarly papers (as of January 11th, 2021) about COVID-19, SARS-CoV-2, and related coronaviruses curated by the Allen Institute for AI. Carrying out an exploratory literature review of these papers has become a time-sensitive and exhausting challenge during the pandemic. The topic modeling pipeline presented in this thesis helps researchers gain an overview of the topics addressed in the papers. The preprocessing framework identifies Unified Medical Language System (UMLS) entities by using MedLinker, which handles Word Sense Disambiguation (WSD) through a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model. The topic model used in this research is a Variational Autoencoder implementing ProdLDA, which is an extension to the Latent Dirichlet Allocation (LDA) model. Applying the pipeline to the CORD-19 dataset achieved a topic coherence value of 0.7 and topic diversity measures of almost 100%.en
dc.identifier.urihttp://hdl.handle.net/10012/16834
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://github.com/DonyaHamzeian/BiomedicalTopicModellingen
dc.relation.urihttps://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challengeen
dc.subjectmachine learningen
dc.subjecttopic modellingen
dc.subjectprodLDAen
dc.subjectLatent Dirichlet Allocationen
dc.subjectBERTen
dc.subjectMedLinkeren
dc.subjectCORD-19en
dc.subjectautomatic exploratory literature reviewen
dc.subjectscoping reviewen
dc.subject.lcshCOVID-19 Pandemic, 2020- , in mass mediaen
dc.subject.lcshMachine learningen
dc.titleUsing Machine Learning Algorithms for Finding the Topics of COVID-19 Open Research Dataset Automaticallyen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentStatistics and Actuarial Scienceen
uws-etd.degree.disciplineStatisticsen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0en
uws.comment.hiddenThe GitHub does not contain the codes yet. However, I will submit the codes by February 24th. I will appreciate it if you could revise the rest of the documents.en
uws.contributor.advisorGhodsi, Ali
uws.contributor.advisorChen, Helen (Assistant Professor)
uws.contributor.affiliation1Faculty of Mathematicsen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Hamzeian_Donya.pdf
Size:
1.96 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: