The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

dc.contributor.authorKhoshdel Nikkhoo, Hani
dc.date.accessioned2011-01-21T16:53:50Z
dc.date.available2011-01-21T16:53:50Z
dc.date.issued2011-01-21T16:53:50Z
dc.date.submitted2011-01-18
dc.description.abstractNear-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.en
dc.identifier.urihttp://hdl.handle.net/10012/5750
dc.language.isoenen
dc.pendingfalseen
dc.publisherUniversity of Waterlooen
dc.subjectnear-duplicate detectionen
dc.subjectMapReduceen
dc.subjectshinglesen
dc.subject.programComputer Scienceen
dc.titleThe Impact of Near-Duplicate Documents on Information Retrieval Evaluationen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentSchool of Computer Scienceen
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Khoshdel_Nikkhoo_Hani.pdf
Size:
1.42 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
258 B
Format:
Item-specific license agreed upon to submission
Description: