Novelty and Diversity in Retrieval Evaluation
MetadataShow full item record
Queries submitted to search engines rarely provide a complete and precise description of a user's information need. Most queries are ambiguous to some extent, having multiple interpretations. For example, the seemingly unambiguous query ``tennis lessons'' might be submitted by a user interested in attending classes in her neighborhood, seeking lessons for her child, looking for online videos lessons, or planning to start a business teaching tennis. Search engines face the challenging task of satisfying different groups of users having diverse information needs associated with a given query. One solution is to optimize ranking functions to satisfy diverse sets of information needs. Unfortunately, existing evaluation frameworks do not support such optimization. Instead, ranking functions are rewarded for satisfying the most likely intent associated with a given query. In this thesis, we propose a framework and associated evaluation metrics that are capable of optimizing ranking functions to satisfy diverse information needs. Our proposed measures explicitly reward those ranking functions capable of presenting the user with information that is novel with respect to previously viewed documents. Our measures reflects quality of a ranking function by taking into account its ability to satisfy diverse users submitting a query. Moreover, the task of identifying and establishing test frameworks to compare ranking functions on a web-scale can be tedious. One reason for this problem is the dynamic nature of the web, where documents are constantly added and updated, making it necessary for search engine developers to seek additional human assessments. Along with issues of novelty and diversity, we explore one approximate approach to compare different ranking functions by overcoming the problem of lacking complete human assessments. We demonstrate that our approach is capable of accurately sorting ranking functions based on their capability of satisfying diverse users, even in the face of incomplete human assessments.