Studying Relevance Judging Behavior of Secondary Assessors
MetadataShow full item record
Secondary assessors, individuals who do not originate search topics and are employed solely to judge the relevancy of documents, have been found to differ in their relevance judgments. Their relevance judgments are used in constructing test collections, which play a significant role in evaluating search systems. These judgments are also used in e-discovery to assist with locating relevant material. To a large extent, our existing understanding of secondary assessors' judging behavior is limited to quantitative measurements. The goal of this thesis is to better understand the relevance judging behavior of secondary assessors. Therefore, we conducted two user studies to achieve this objective. The first study, which forms the main part of this thesis, was a think-aloud study, and provides what may be the first of such qualitative studies of secondary assessors' judging behavior. The second study of the research was to capture the uncertainty in secondary assessors' relevance judgments. Further examination of the behavior of secondary assessors when judging multiple types of documents was also carried out based on the data from the think-aloud study. Data obtained through the think-aloud method, permitted us to achieve more in-depth insight into secondary assessors' relevance judging behavior. We were able to directly listen to and note their thoughts during the assigned search tasks. Based on this data, we found that relevance judgments are made with differing levels of certainty. These levels of certainty vary from low to high. We also found that the varying factors of a search topic, the document, and the assessor can each impact differing judgments. The think-aloud study also reveals preliminary evidence regarding how the amount of detail stated in a search topic's description influences the relevance judging behavior of secondary assessors. To capture the uncertainty in secondary assessors' relevance judgments, we designed four user interfaces in our second user study. The objective was to study the uncertainty in secondary assessors' relevance judgments when the level of uncertainty is self-reported. We found that they tend to make high certain relevance judgments despite the consensus level of a document. In judging high consensus documents, assessors' accuracy was lower when making low certainty relevance judgments, and the judgments were more accurate and tended to agree with NIST assessors when making high certainty relevance judgments. For low consensus documents, we found assessors' accuracy to be low regardless of their certainty level. Finally, we found that assessors tend to spend less time when making high certainty relevance judgments, regardless of the consensus level of the document. Further study of the behavior of secondary assessors when judging multiple types of documents, identified that relevance judgments are occasionally based on incorrect perception. We show how factors such as lack of familiarity, lack of understanding the search topic, absence of keywords and other reasons could be a source of not only incorrect relevance judgments, but also of those which are correct. We also illustrate how the length of search topics and documents, and their level of difficulty may further contribute to the issue of variations in the judgments. Our research overall contributes to a more extensive, meaningful understanding of the behavior of secondary assessors. It establishes a foundation for more pertinent work in the future on the impact of uncertainty in secondary assessor's relevance judgments. Our findings also show that assessor training and background, search topics, and document length should be all considered and given additional attention in order to obtain more reliable results.