The suffix-signature method for searching for phrases in text

Zhou, Mei2006-07-282006-07-2819971997http://hdl.handle.net/10012/213Finding all occurrences of a given word in a large static text is a well-studied problem. Most solutions, however, are not well-suited for phrase-searching. In this thesis, we investigate a new algorithm to find all occurrences of a given phrase in a large, static text, based on the data structure known as a suffix array. Using this algorithm, phrases of bounded length can be found with expected search time of one disk access to the text and one disk access to an index. To achieve this performance for phrases of up to five words in length requires an index having total size of approximately 120% of the size of the text. The algorithm guarantees a worst case search performance of 2 disk accesses to the text per phrase search. The method augments a suffix array with a parallel signature array, so that every indexed phrase has an associated signature. To search for a phrase, we search a block of the index in memory to locate matching signatures. Then we read one or two phrases corresponding to matching signatures from disk and compare them to the target phrase to filter out false matches. We present theoretical properties of the data structure and algorithm derived from a suitable model. The theoretical results have been validated through experimentation with actual query patterns derived from logs of searches on the World Wide Web. These experiments show that the approach is applicable in practice to a variety of texts and realistic phrase searches.application/pdf9827253 bytesapplication/pdfenCopyright: 1997, Zhou, Mei. All rights reserved.Harvested from Collections CanadaThe suffix-signature method for searching for phrases in textDoctoral Thesis