|
UWSpace >
University of Waterloo >
Electronic Theses and Dissertations (UW) >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10012/5750
|
| Title: | The Impact of Near-Duplicate Documents on Information Retrieval Evaluation |
| Authors: | Khoshdel Nikkhoo, Hani |
| Keywords: | near-duplicate detection MapReduce shingles |
| Approved Date: | 21-Jan-2011 |
| Date Submitted: | 18-Jan-2011 |
| Abstract: | Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track. |
| Program: | Computer Science |
| Department: | School of Computer Science |
| Degree: | Master of Mathematics |
| URI: | http://hdl.handle.net/10012/5750 |
| Appears in Collections: | Electronic Theses and Dissertations (UW) Faculty of Mathematics Theses and Dissertations
|
This item is protected by original copyright
|
All items in UWSpace are protected by copyright, with all rights reserved.
|