Exploring functional annotation through genomic and metagenomic data mining
MetadataShow full item record
Functional profiling of genomes and metagenomes, as well as data mining for novel proteins, all rely on computational methods for functional annotation of protein sequences. Standard methods assign protein function based on detected homology to reference sequences, but often leave behind a significant fraction of hypothetical sequences ("dark matter") that cannot be annotated. To maximize our ability to extract new biological insights from newly sequenced genomes, it is critical to understand the advantages and limitations of homology-based annotation, and explore alternative methods for inferring function. In this thesis, I performed a comprehensive exploration of computational protein annotation, with a focus on bacterial genomes and metagenomes. First, I applied homology-based methods to functionally annotate and analyze original datasets including newly sequenced Streptomyces strains, a wastewater metagenome, and microbial communities involved in vertebrate decomposition. These studies identified genes and functions of interest including cellulases, antibiotic resistance genes, and virulence factors. I then explored the limits of homology-based annotation by measuring annotation coverage, the fraction of annotated proteins in a proteome, across ~27,000 organisms in the microbial tree of life. This study demonstrated a wide range in annotation coverage across bacteria, from 2-86%. In addition, it revealed multiple factors including taxonomy, genome size, and research bias, as heavy influences on the degree to which proteomes could be annotated. To gain biological insights into hypothetical proteins of unknown function, I analyzed 4,049 domains of unknown function (DUFs) from Pfam. Using phylogenomic, taxonomic and metagenomic information, I detected statistical associations between domains and biological traits. Association-based methods uncovered environment, lineage, and/or pathogen associations in just under half of all DUFs and highlighted new families such as DUF4765 as intriguing virulence factor candidates. Finally, I constructed a database of "ORFan" metagenomic sequences that cannot be annotated using standard approaches, and inferred functions for tens of thousands of these sequences using profile-profile comparison approaches. Motif analysis and genomic context validated these predictions, enabling the discovery of hundreds of novel candidate metalloproteases. Protein "dark matter", which includes a large pool of unannotated coding sequences, is an incredible resource to find new proteins and functions of interest, and included are suggestions on how to prioritize these sequences for future study. A combination of homology-based and alternative annotation methods will be most effective for broad functional profiling of genomes and metagenomes, and can push the boundaries for functional interpretation of sequence data.
Cite this version of the work
Briallen Lobb (2020). Exploring functional annotation through genomic and metagenomic data mining. UWSpace. http://hdl.handle.net/10012/16267
Showing items related by title, author, creator and subject.
Peng, Zhongyu (University of Waterloo, 2016-08-24)We present a novel bag-of-words based approach that automatically constructs a semantic parsing based question answering (QA) system tailored to single-entity-single-relation questions. Given a large community QA pair ...
The Fifteenth-Century Middle High German Tale The Queen of France: Diplomatic Edition and Annotated Translation of Heidelberg, Universitätsbibliothek, Heid. Hs. 1012, fol. 249r-254v Koepcke, Jana (University of Waterloo, 2018-09-05)This thesis treats one version of the anonymous, Middle High German, rhymed couplet text known as The Queen of France, as extant in the manuscript Heidelberg, Universitätsbibliothek, Heid. Hs. 1012, fol. 249r-254v. It ...
Simple Convolutional Neural Networks with Linguistically-Annotated Input for Answer Selection in Question Answering Sequiera, Royal (University of Waterloo, 2018-08-10)With the advent of deep learning methods, researchers have been increasingly preferring deep learning methods over decades-old feature-engineering-inspired work in Natural Language Processing (NLP). The research community ...