Reducing Health Misinformation in Search Results
Loading...
Date
2022-08-22
Authors
Zhang, Dake
Advisor
Smucker, Mark
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
People commonly search the web for answers to health-related questions. With health information being added to the Internet every day, misinformation proliferates and disseminates wildly. Previous work has shown that if health misinformation exists in search results, people can make incorrect decisions, which may cause negative effects on their lives. To reduce health misinformation in search results, we need to be able to find web documents that contain correct information and promote them to higher positions in search results over documents that contain misinformation. In this thesis, we describe our efforts in reducing health misinformation in search results.
First, we describe our participation in the TREC 2021 Health Misinformation Track, which provides a framework for evaluating ranking approaches to reducing health misinformation in search results. This track uses the Compatibility Difference as the primary evaluation metric, which measures the approach's ability to rank correct and credible documents before incorrect and non-credible documents. In the 2021 track, runs that used the provided correct answers were viewed as manual runs. By making use of the known answers and applying a Stance Detection Model for reranking, our manual method achieved a Compatibility Difference score of 0.176, a dramatic improvement over the BM25 baseline with a score of -0.022.
Second, as an extension of our work above, we present a pipeline to automatically derive correct answers by learning trustworthy web sources and then reduce health misinformation in search engine results. Determining the correct answer has been a difficult hurdle to overcome for participants in the TREC Health Misinformation Track. In the 2021 track, automatic runs were not allowed to use the known answer to a topic’s health question. By exploiting an existing set of health questions and corresponding known answers, we show it is possible to learn which web hosts are trustworthy, from which we can predict the correct answers to the 2021 health questions with an accuracy of 76%. Using our predicted answers, we can promote documents that we predict contain this answer and achieve a Compatibility Difference score of 0.129, achieving a three-fold performance increase compared with the previous best automatic method with a score of 0.043.
To wrap up, evaluated on the TREC 2021 Health Misinformation Track, our final pipeline achieves new state-of-the-art performance among automatic runs.