University of Waterloo >
Electronic Theses and Dissertations (UW) >
Please use this identifier to cite or link to this item:
|Title: ||The Impact of Near-Duplicate Documents on Information Retrieval Evaluation|
|Authors: ||Khoshdel Nikkhoo, Hani|
|Keywords: ||near-duplicate detection|
|Approved Date: ||21-Jan-2011 |
|Date Submitted: ||18-Jan-2011 |
|Abstract: ||Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.|
|Program: ||Computer Science|
|Department: ||School of Computer Science|
|Degree: ||Master of Mathematics|
|Appears in Collections:||Electronic Theses and Dissertations (UW)|
Faculty of Mathematics Theses and Dissertations
All items in UWSpace are protected by copyright, with all rights reserved.