Enhancements to Hidden Markov Models for Gene Finding and Other Biological Applications

Vinar, Tomas

Enhancements to Hidden Markov Models for Gene Finding and Other Biological Applications

Files

tvinar2005.pdf (1.44 MB)

Date

2005

Authors

Vinar, Tomas

Publisher

University of Waterloo

Abstract

In this thesis, we present enhancements of hidden Markov models for the problem of finding genes in DNA sequences. Genes are the parts of DNA that serve as a template for synthesis of proteins. Thus, gene finding is a crucial step in the analysis of DNA sequencing data. Hidden Markov models are a key tool used in gene finding. Yhis thesis presents three methods for extending the capabilities of hidden Markov models to better capture the statistical properties of DNA sequences. In all three, we encounter limiting factors that lead to trade-offs between the model accuracy and those limiting factors. First, we build better models for recognizing biological signals in DNA sequences. Our new models capture non-adjacent dependencies within these signals. In this case, the main limiting factor is the amount of training data: more training data allows more complex models. Second, we design methods for better representation of length distributions in hidden Markov models, where we balance the accuracy of the representation against the running time needed to find genes in novel sequences. Finally, we show that creating hidden Markov models with complex topologies may be detrimental to the prediction accuracy, unless we use more complex prediction algorithms. However, such algorithms require longer running time, and in many cases the prediction problem is NP-hard. For gene finding this means that incorporating some of the prior biological knowledge into the model would require impractical running times. However, we also demonstrate that our methods can be used for solving other biological problems, where input sequences are short. As a model example to evaluate our methods, we built a gene finder ExonHunter that outperforms programs commonly used in genome projects.