Similarity and Diversity in Information Retrieval
MetadataShow full item record
Inter-document similarity is used for clustering, classification, and other purposes within information retrieval. In this thesis, we investigate several aspects of document similarity. In particular, we investigate the quality of several measures of inter-document similarity, providing a framework suitable for measuring and comparing the effectiveness of inter-document similarity measures. We also explore areas of research related to novelty and diversity in information retrieval. The goal of diversity and novelty is to be able to satisfy as many users as possible while simultaneously minimizing or eliminating duplicate and redundant information from search results. In order to evaluate the effectiveness of diversity-aware retrieval functions, user query logs and other information captured from user interactions with commercial search engines are mined and analyzed in order to uncover various informational aspects underlying queries, which are known as subtopics. We investigate the suitability of implicit associations between document content as an alternative to subtopic mining. We also explore subtopic mining from document anchor text and anchor links. In addition, we investigate the suitability of inter-document similarity as a measure for diversity-aware retrieval models, with the aim of using measured inter-document similarity as a replacement for diversity-aware evaluation models that rely on subtopic mining. Finally, we investigate the suitability and application of document similarity for requirements traceability. We present a fast algorithm that uncovers associations between various versions of frequently edited documents, even in the face of substantial changes.