The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Khoshdel Nikkhoo, Hani

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Files

Khoshdel_Nikkhoo_Hani.pdf (1.42 MB)

Date

2011-01-21T16:53:50Z

Authors

Khoshdel Nikkhoo, Hani

Publisher

University of Waterloo

Abstract

Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.