A Preference Judgment Interface for Authoritative Assessment

Loading...
Thumbnail Image

Date

2023-02-06

Authors

Seifikar, Mahsa

Advisor

Clarke, Charles

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

For offline evaluation of information retrieval systems, preference judgments have been demonstrated to be a superior alternative to graded or binary relevance judgments. In contrast to graded judgments, where each document is assigned to a pre-defined grade level, with preference judgments, assessors judge a pair of items presented side by side, indicating which is better. Unfortunately, preference judgments may require a larger number of judgments, even under an assumption of transitivity. Until recently they also lacked well-established evaluation measures. Previous studies have explored various evaluation measures and proposed different approaches to address the perceived shortcomings of preference judgments. These studies focused on crowdsourced preference judgments, where assessors may lack the training and time to make careful judgments. They did not consider the case where assessors have been trained and provided with the time to carefully consider differences between items. For offline evaluation of information retrieval systems, preference judgments have been demonstrated to be a superior alternative to graded or binary relevance judgments. In contrast to graded judgments, where each document is assigned to a pre-defined grade level, with preference judgments, assessors judge a pair of items presented side by side, indicating which is better. Unfortunately, preference judgments may require a larger number of judgments, even under an assumption of transitivity. Until recently they also lacked well-established evaluation measures. Previous studies have explored various evaluation measures and proposed different approaches to address the perceived shortcomings of preference judgments. These studies focused on crowdsourced preference judgments, where assessors may lack the training and time to make careful judgments. They did not consider the case where assessors have been trained and provided with the time to carefully consider differences between items. We review the literature in terms of algorithms and strategies for extracting preference judgment, evaluation metrics, interface design, and use of crowdsourcing. In this thesis, we design and build a new framework for preference judgment called JUDGO, with various components designed for expert reviewers and researchers. We also suggested a new heap-like preference judgment algorithm that assumes transitivity and tolerates ties. With the help of our framework, NIST assessors found the top-10 best items of each 38 topics for TREC 2022 Health Misinformation Track, with more than 2,200 judgments collected. Our analysis shows that assessors frequently use the search box feature, which enables them to highlight their own keywords in documents, but they are less interested in highlighting documents with the mouse. As a result of additional feedback, we make some modifications to the initially proposed algorithm method and highlighting features.

Description

Keywords

preference judgment, offline evaluation, information retrieval

LC Subject Headings

Citation