Large-scale phylogenomic visualization and analysis of functional traits in bacteria
MetadataShow full item record
The growth of genomic information in public databases has dramatically improved our view of the tree of life and at the same time expanded our knowledge of protein diversity. Through the use of automated annotation pipelines, researchers can predict many of the functional capabilities of organisms directly from their genome sequence. Although there exist numerous phylogenetic and protein databases, there have been fewer attempts to combine these data, which is essential for the study of protein evolution. The web application AnnoTree (annotree.uwaterloo.ca) was developed as part of this thesis to facilitate the exploration and visualization of protein families (Pfams) and KEGG orthologs (KOs) on a phylogeny composed of nearly 24,000 bacterial genomes. The visualization includes an interactive tree of life, a summary of the taxonomic distribution of the query, basic taxonomic information, and annotation confidence scores. All protein sequences, visualizations, and summary information can be downloaded directly from the interface. The AnnoTree framework is open-source and can be modified to incorporate any custom tree, taxonomy, and proteome dataset. AnnoTree allows users to visualize the phylogenetic distribution of a Pfam of interest, which, in combination with obtained gain/loss data, promotes hypothesis-generation in the context of protein evolution. To identify functions that are more tightly associated with evolutionary mechanisms such as horizontal gene transfer and evolutionary conservation, the pre-computed annotation data were combined with the bacterial tree of life in a phylogenomics analysis. The phyletic patchiness of all Pfam and KO annotations was measured using the normalized consistency index (CI), a measure of disagreement between the presence/absence states of traits across the tree and the tree topology. Pfams and KOs with the highest normalized CI represent functions known to be associated with mobile genetic elements and viral defence. These annotations were most commonly found within the genomes of symbiotic and pathogenic bacteria. The most highly conserved Pfams and KOs were functions related to core processes such as transcription, DNA replication, and protein synthesis as well as those required for oxygenic photosynthesis and sporulation. Lineage-specific Pfams and KOs were classified in many bacterial taxa, revealing many clade-defining functions in the Baccilus_A genus, the Oxyphotobacteria class, and the Actinobacteria class, among others. An additional phylogenomics analysis was performed to identify branches of a phylogeny encompassing representatives from all three domains of life undergoing the most Pfam gain and loss events. The branches dividing the three taxonomic domains had the highest density of gain events, all of which were associated with well-known clade-defining functions. Missing data influenced the frequency of Pfam losses in lower taxonomic levels, but some characterized genome streamlining events within Eukaryotes were uncovered. Ultimately, the development of AnnoTree and accompanying analyses provide new insights into large-scale bacterial phylogenomics and the evolution and distributions of bacterial protein domains and gene families.
Cite this version of the work
Kerrin Mendler (2019). Large-scale phylogenomic visualization and analysis of functional traits in bacteria. UWSpace. http://hdl.handle.net/10012/14420