The Impact of Near-Duplicate Documents on Information Retrieval Evaluation
Loading...
Date
2011-01-21T16:53:50Z
Authors
Khoshdel Nikkhoo, Hani
Advisor
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.
Description
Keywords
near-duplicate detection, MapReduce, shingles