From genomes to metagenomes: Development of a rapid-aligner for genome assembly and application of macroecological models to microbiology

Hilts, Angus

From genomes to metagenomes: Development of a rapid-aligner for genome assembly and application of macroecological models to microbiology

Files

Hilts_Angus.pdf (4.02 MB)

Date

2019-09-26

Authors

Hilts, Angus

Advisor

Hug, Laura

Publisher

University of Waterloo

Abstract

Since the development of the modern computer, many scientific fields have undergone paradigm shifts due to an increasing facility in data collection and analysis. Microbiology has been impacted by computational advances, especially in DNA sequencing applications, and this has led to an interesting problem: there is too much raw data for any person to understand. It is important to have tools that are able to process and analyze these vast amounts of data, so that microbiologists can robustly test hypotheses and predict patterns. Long-read sequencers are capable of sequencing entire genomes with very few reads, but exhibit much higher error rates compared to short-read sequencing platforms. Most current genome assemblers were developed for highly accurate short-read data, and so there is a need to build new tools that can handle these long, error-filled reads. Here, we developed an alignment algorithm in the C programming language for error-prone long reads, as part of a larger genome assembler. This alignment algorithm creates a profile of ordered kmers representing all of the reads, then clusters these kmers to generate a consensus sequence. We show that the alignment algorithm can handle long-read error rates and produce useful results. Using a low-coverage test data set, the algorithm was able to produce a consensus sequence with 85.3% identity to a reference sequence built with extremely high coverage data. Future work will aim to improve this accuracy by error correcting kmers and identifying close repeats of kmers. The field of metagenomics is entering a new state of maturation. Isolation of total community DNA, shotgun sequencing, and assembly of draft genomes for populations has become standard practice in many microbial ecology labs, and many pipelines for manipulating metagenomic sequence data exist. What is not as well understood, however, is how to analyze the growing databases of metagenomic datasets with statistical rigour. To examine the relationships and interactions of different groups of microorganisms across the planet requires strong statistical models that can be used to assess hypotheses. We borrowed occupancy modelling from the macroecological toolbox, and adapted it to microbial metagenomic datasets. Occupancy models are designed to assess the occupancy states of sample sites, while accounting for possible missed detections by re-sampling these sites. We emulate re-sampling by searching for multiple genes associated with functions of interest, where each gene is considered an independent sampling event. We use detection of these genes as proxies for presence of functional potential within environments, and can assess occurrence and, importantly, co-occurrence patterns. We applied this method to nearly 10,000 metagenomes to assess global occupancy patterns for methanogens and methanotrophs, key contributors to the methane cycle. To assess the occupancy patterns of methane cyclers, we looked for genes encoding the subunits for the methyl coenzyme M reductase complex (MCR) and the methane monooxygenases (MMO), biological markers of methanogenesis and methanotrophy, respectively. Our models predicted that occupancy probabilities for both functional groups changed with ecosystem type, latitude, and the date that the data were deposited to the database. The explanatory power of the models was relatively low, which is likely due to a lack of metadata that could be used to better inform models. Occupancy models have the potential to be powerful tools, but microbial ecologists will need to embrace better standards for metadata collection and reporting for metagenomes. This metadata could include the collection of data such as pH, temperature, and other key environmental factors. Future work should focus on establishing and enforcing these metadata requirements to enable statistical assessment of functionally important groups across environments.

URI

http://hdl.handle.net/10012/15172

Collections

Theses
Biology

Full item page

From genomes to metagenomes: Development of a rapid-aligner for genome assembly and application of macroecological models to microbiology

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections