The suffix-signature method for searching for phrases in text

Loading...
Thumbnail Image

Date

Authors

Zhou, Mei

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Finding all occurrences of a given word in a large static text is a well-studied problem. Most solutions, however, are not well-suited for phrase-searching. In this thesis, we investigate a new algorithm to find all occurrences of a given phrase in a large, static text, based on the data structure known as a suffix array. Using this algorithm, phrases of bounded length can be found with expected search time of one disk access to the text and one disk access to an index. To achieve this performance for phrases of up to five words in length requires an index having total size of approximately 120% of the size of the text. The algorithm guarantees a worst case search performance of 2 disk accesses to the text per phrase search. The method augments a suffix array with a parallel signature array, so that every indexed phrase has an associated signature. To search for a phrase, we search a block of the index in memory to locate matching signatures. Then we read one or two phrases corresponding to matching signatures from disk and compare them to the target phrase to filter out false matches. We present theoretical properties of the data structure and algorithm derived from a suitable model. The theoretical results have been validated through experimentation with actual query patterns derived from logs of searches on the World Wide Web. These experiments show that the approach is applicable in practice to a variety of texts and realistic phrase searches.

Description

LC Subject Headings

Citation