Show simple item record

dc.contributor.authorKane, Andrew
dc.date.accessioned2014-11-20 20:50:57 (GMT)
dc.date.available2014-11-20 20:50:57 (GMT)
dc.date.issued2014-11-20
dc.date.submitted2014
dc.identifier.urihttp://hdl.handle.net/10012/8945
dc.description.abstractThis thesis examines space-time optimizations of in-memory search engines. Search engines can answer queries quickly, but this is accomplished using significant resources in the form of multiple machines running concurrently. Improving the performance of search engines means reducing the resource costs, such as hardware, energy, and cooling. These saved resources can then be used to improve the effectiveness of the search engine or provide additional added value to the system. We improve the space-time performance for search engines in the context of in-memory conjunctive intersection of ordered document identifier lists. We show that reordering of document identifiers can produce dense regions in these lists, where bitvectors can be used to improve the efficiency of conjunctive list intersection. Since the process of list intersection is a fundamental building block and a major performance bottleneck for search engines, this work will be important for all search engine researchers and developers. Our results are presented in three stages. First, we show how to combine multiple existing techniques for list intersection to improve space-time performance. We combine bitvectors for large lists with skips over compressed values for the other lists. When the skips are large and overlaid on the compressed lists, space-time performance is superior to existing techniques, such as using skips or bitvectors separately. Second, we show that grouping documents by size and ordering by URL within groups combines the skewed clustering that results from document size ordering with the tight clustering that results from URL ordering. We propose a new semi-bitvector data structure that encodes the front of a list, including groups with large documents, as a bitvector and the rest of the list as skips over compressed values. This combination produces significant space-time performance gains on top of the gains from the first stage. Third, we show how partitioning by document size into separate indexes can also produce high density regions that can be exploited by bitvectors, resulting in benefits similar to grouping by document size within one index. This partitioning technique requires no modification of the intersection algorithms, and it is therefore broadly applicable. We further show that any of our partitioning approaches can be combined with semi-bitvectors and grouping within each partition to effectively exploit skewed clustering and tight clustering in our dataset. A hierarchy of partitioning approaches may be required to exploit clustering in very large document collections.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectInformation Retrievalen
dc.subjectAlgorithmsen
dc.subjectPerformanceen
dc.subjectEfficiencyen
dc.subjectOptimizationen
dc.subjectCompressionen
dc.subjectIntersectionen
dc.subjectDistributionen
dc.subjectPartitioningen
dc.subjectReorderingen
dc.subjectBitvectorsen
dc.titleIntegrating Skips and Bitvectors for List Intersectionen
dc.typeDoctoral Thesisen
dc.pendingfalse
dc.subject.programComputer Scienceen
uws-etd.degree.departmentSchool of Computer Scienceen
uws-etd.degreeDoctor of Philosophyen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages