Some string problems in computational biology
Loading...
Files
Date
2000
Authors
Lanctôt, J. Kevin
Advisor
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
This thesis introduces and analyzes a collection of string algorithms that are at the core of several biological problems.
First. it presents the Grammar Transform Analysis and Compression (GTAC) entropy estimator. the first entropy estimator for DNA sequences that has both proven properties and excellent entropy estimates. Additionally. the estimator uses a novel data structure to repeatedly solve the Longest Non-overlapping Pattern Problem in linear time. GTAC beats all known competitors in running time. in the low values of its entropy estimates. and in the number of properties that have been proven about it.
Second. it presents the Distinguishing String Problem. which has many biological applications such as creating diagnostic probes. universal primers. unbiased consensus sequences. and discovering potential drug targets. All these applications reduce to the task of finding a pattern that. with some e1Tor. occurs in one set of strings ( the Closest String Problem and the Closest Substring Problem) and does not occur in another set ( the Farthest String Problem and the Farthest Substring Problem). The NP-hardness of approximation properties of these problems are characterized. and approximation algorithms are presented.
Description
Keywords
Harvested from Collections Canada