The University of Waterloo Libraries will be performing maintenance on UWSpace tomorrow, November 5th, 2025, from 10 am – 6 pm EST.
UWSpace will be offline for all UW community members during this time. Please avoid submitting items to UWSpace until November 7th, 2025.

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Loading...
Thumbnail Image

Authors

Khoshdel Nikkhoo, Hani

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.

Description

LC Subject Headings

Citation