dc.contributor.author | Rizvi, Saira | |
dc.date.accessioned | 2021-10-01 15:24:37 (GMT) | |
dc.date.available | 2021-10-01 15:24:37 (GMT) | |
dc.date.issued | 2021-10-01 | |
dc.date.submitted | 2021-09-15 | |
dc.identifier.uri | http://hdl.handle.net/10012/17609 | |
dc.description.abstract | This work introduces the task of misinformation retrieval, identifying all documents containing misinformation for a given topic, and proposes a pipeline for misinformation retrieval on tweets. As part of the work, I curated 50 COVID-19 misinformation topics used in the TREC 2020 Health Misinformation track. In addition, I annotated a test set of tweets using the TREC COVID-19 misinformation on social media. Misinformation on social media has proven highly detrimental to communities by encouraging harmful and often life-threatening behavior. The chaos caused by COVID-19 misinformation has created an urgent need for misinformation detection methods to moderate social media platforms. Drawing upon previous work in misinformation detection and the TREC 2020 Health Misinformation Track, I focused on the task of misinformation retrieval on social media. I extended the COVID-Lies data set created to detect COVID-19 misinformation in tweets by rephrasing the misconceptions accompanying each tweet. I also created 50 COVID-19 related topics for the TREC 2020 Health Misinformation track used for evaluation purposes. I propose a natural language inference (NLI) based approach using CT-BERT to identify tweets that contradict a given fact, used to score documents utilizing the model’s classification probability. The model was trained using a combination of NLI data sets to find the best approach. Tweets were labeled for the TREC 2020 Health Misinformation Track topics to create a test set on which the best model achieves an AUC of 0.81. I conducted several experiments which show that domain adaptation significantly improved the ability to detect misinformation. A combination of a large NLI corpus, such as SNLI, and an in-domain, such as the COVID-Lies, data set achieves the best performance on our test set. The pipelines retrieved and ranked tweets based on misinformation for 7 TREC topics from the COVID-19 Twitter stream. The top 20 unique tweets were analyzed using Precision@20 to evaluate the pipeline. | en |
dc.language.iso | en | en |
dc.publisher | University of Waterloo | en |
dc.subject | Data Science | en |
dc.subject | Information Retrieval | en |
dc.subject | Natural Language Processing | en |
dc.subject | Misinformation | en |
dc.subject.lcsh | Information retrieval | en |
dc.subject.lcsh | Natural language processing (Computer science) | en |
dc.subject.lcsh | Misinformation | en |
dc.title | Misinformation Retrieval | en |
dc.type | Master Thesis | en |
dc.pending | false | |
uws-etd.degree.department | David R. Cheriton School of Computer Science | en |
uws-etd.degree.discipline | Computer Science | en |
uws-etd.degree.grantor | University of Waterloo | en |
uws-etd.degree | Master of Mathematics | en |
uws-etd.embargo.terms | 0 | en |
uws.contributor.advisor | Clarke, Charles L. A.,1964- | |
uws.contributor.affiliation1 | Faculty of Mathematics | en |
uws.published.city | Waterloo | en |
uws.published.country | Canada | en |
uws.published.province | Ontario | en |
uws.typeOfResource | Text | en |
uws.peerReviewStatus | Unreviewed | en |
uws.scholarLevel | Graduate | en |