Phan Minh, Linh Nhi2023-04-122023-04-122023-04-122023-04-05http://hdl.handle.net/10012/19272Preference judging has been proposed as an effective method to identify the most relevant documents for a given search query. In this thesis, we investigate the degree to which assessors using a preference judging system are able to consistently find the same top documents and how consistent they are in their own preferences. We also examine to what extent variability in assessor preferences affect the evaluation of information retrieval systems. We designed and conducted a user study where 40 participants were recruited to preference judge 30 topics taken from the 2021 TREC Health Misinformation track. The research study found that the number of judgments needed to find the top-10 preferred documents using preference judging is about twice the number of documents in that topic. It also suggests that relying on just one non-professional assessor to do preference judging is not sufficient for evaluating information retrieval systems. Additionally, the study showed that preference judging to find the top-10 documents does significantly change the rankings of runs as compared to the rankings reported in the TREC 2021 Health Misinformation track, with most changes happening among the lower-ranked runs rather than the top-ranked runs. Overall, this thesis provides insights into assessor behaviour and assessor agreement when using preference judgments for evaluating information retrieval systems.enAn Investigation of Preference Judging ConsistencyMaster Thesis