Fast Algorithms for Large-Scale Phylogenetic Reconstruction

Truszkowski, Jakub

Fast Algorithms for Large-Scale Phylogenetic Reconstruction

Files

Truszkowski_Jakub.pdf (2.43 MB)

Date

2013-10-24T18:02:25Z

Authors

Truszkowski, Jakub

Publisher

University of Waterloo

Abstract

One of the most fundamental computational problems in biology is that of inferring evolutionary histories of groups of species from sequence data. Such evolutionary histories, known as phylogenies are usually represented as binary trees where leaves represent extant species, whereas internal nodes represent their shared ancestors. As the amount of sequence data available to biologists increases, very fast phylogenetic reconstruction algorithms are becoming necessary. Currently, large sequence alignments can contain up to hundreds of thousands of sequences, making traditional methods, such as Neighbor Joining, computationally prohibitive. To address this problem, we have developed three novel fast phylogenetic algorithms. The first algorithm, QTree, is a quartet-based heuristic that runs in O(n log n) time. It is based on a theoretical algorithm that reconstructs the correct tree, with high probability, assuming every quartet is inferred correctly with constant probability. The core of our algorithm is a balanced search tree structure that enables us to locate an edge in the tree in O(log n) time. Our algorithm is several times faster than all the current methods, while its accuracy approaches that of Neighbour Joining. The second algorithm, LSHTree, is the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution. Our new algorithm runs in O(n^{1+γ(g)} log^2 n) time, where γ is an increasing function of an upper bound on the mutation rate along any branch in the phylogeny, and γ(g) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm is close to linear. In experiments, our prototype implementation was more accurate than the current fast algorithms, while being comparably fast. In the final part of this thesis, we apply the algorithmic framework behind LSHTree to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree. Our initial results in this area are promising, but there are still many challenges to be resolved.