Upper and Lower Bounds for Text
Upper and Lower Bounds for Text Indexing Data Structures

Golynski, Alexander

Upper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures

Files

thesis.pdf (963.01 KB)

Date

2008-01-23T14:37:36Z

Authors

Golynski, Alexander

Publisher

University of Waterloo

Abstract

The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for full-text indices which occupies minute space (FM-indexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XML documents, and posting lists in search engines. The main emphasis of this thesis is on lower bounds for time-space tradeoffs for the following problems: the rank/select problem, the problem of representing a string of balanced parentheses, the text retrieval problem, the problem of computing a permutation and its inverse, and the problem of representing a binary relation. These results are divided in two groups: lower bounds in the cell probe model and lower bounds in the indexing model. The cell probe model is the most natural and widely accepted framework for studying data structures. In this model, we are concerned with the total space used by a data structure and the total number of accesses (probes) it performs to memory, while computation is free of charge. The indexing model imposes an additional restriction on the storage: the object in question must be stored in its raw form together with a small index that facilitates an efficient implementation of a given set of queries, e.g. finding rank, select, matching parenthesis, or an occurrence of a given pattern in a given text (for the text retrieval problem). We propose a new technique for proving lower bounds in the indexing model and use it to obtain lower bounds for the rank/select problem and the balanced parentheses problem. We also improve the existing techniques of Demaine and Lopez-Ortiz using compression and present stronger lower bounds for the text retrieval problem in the indexing model. The most important result of this thesis is a new technique for cell probe lower bounds. We demonstrate its strength by proving new lower bounds for the problem of representing permutations, the text retrieval problem, and the problem of representing binary relations. (Previously, there were no non-trivial results known for these problems.) In addition, we note that the lower bounds for the permutations problem and the binary relations problem are tight for a wide range of parameters, e.g. the running time of queries, the size and density of the relation.