Evaluating Information Retrieval Systems With Multiple Non-Expert Assessors
Many current test collections require the use of expert judgments during construction. The true label of each document is given by an expert assessor. However, the cost and effort associated with expert training and judging are typically quite high in the event where we have a high number of documents to judge. One way to address this issue is to have each document judged by multiple non-expert assessors at a lower expense. However, there are two key factors that can make this method difficult: the variability across assessors' judging abilities, and the aggregation of the noisy labels into one single consensus label. Much previous work has shown how to utilize this method to replace expert labels in the relevance evaluation. However, the effects of relevance judgment errors on the ranking system evaluation have been less explored. This thesis mainly investigates how to best evaluate information retrieval systems with noisy labels, where no ground-truth labels are provided, and where each document may receive multiple noisy labels. Based on our simulation results on two datasets, we find that conservative assessors that tend to label incoming documents as non-relevant are preferable. And there are two important factors affect the overall conservativeness of the consensus labels: the assessor's conservativeness and the relevance standard. This important observation essentially provides a guideline on what kind of consensus algorithms or assessors are needed in order to preserve the high correlation with expert labels in ranking system evaluation. Also, we systematically investigate how to find the consensus labels for those documents with equal confidence to be either relevant or non-relevant. We investigate a content-based consensus algorithm which links the noisy labels with document content. We compare it against the state-of-art consensus algorithms, and find that, depending on the document collection, this content-based approach may help or hurt the performance.