Integrating Skips and Bitvectors for List Intersection

Kane, Andrew

Integrating Skips and Bitvectors for List Intersection

Files

Kane_Andrew.pdf (754.45 KB)

Date

2014-11-20

Authors

Kane, Andrew

Publisher

University of Waterloo

Abstract

This thesis examines space-time optimizations of in-memory search engines. Search engines can answer queries quickly, but this is accomplished using significant resources in the form of multiple machines running concurrently. Improving the performance of search engines means reducing the resource costs, such as hardware, energy, and cooling. These saved resources can then be used to improve the effectiveness of the search engine or provide additional added value to the system. We improve the space-time performance for search engines in the context of in-memory conjunctive intersection of ordered document identifier lists. We show that reordering of document identifiers can produce dense regions in these lists, where bitvectors can be used to improve the efficiency of conjunctive list intersection. Since the process of list intersection is a fundamental building block and a major performance bottleneck for search engines, this work will be important for all search engine researchers and developers. Our results are presented in three stages. First, we show how to combine multiple existing techniques for list intersection to improve space-time performance. We combine bitvectors for large lists with skips over compressed values for the other lists. When the skips are large and overlaid on the compressed lists, space-time performance is superior to existing techniques, such as using skips or bitvectors separately. Second, we show that grouping documents by size and ordering by URL within groups combines the skewed clustering that results from document size ordering with the tight clustering that results from URL ordering. We propose a new semi-bitvector data structure that encodes the front of a list, including groups with large documents, as a bitvector and the rest of the list as skips over compressed values. This combination produces significant space-time performance gains on top of the gains from the first stage. Third, we show how partitioning by document size into separate indexes can also produce high density regions that can be exploited by bitvectors, resulting in benefits similar to grouping by document size within one index. This partitioning technique requires no modification of the intersection algorithms, and it is therefore broadly applicable. We further show that any of our partitioning approaches can be combined with semi-bitvectors and grouping within each partition to effectively exploit skewed clustering and tight clustering in our dataset. A hierarchy of partitioning approaches may be required to exploit clustering in very large document collections.