Advanced Machine Learning Techniques for Taxonomic Classification and Clustering of DNA Sequences

No Thumbnail Available

Date

2025-01-09

Advisor

Kari, Lila
Lu, Yang Young

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Advancements in genomic sequencing have exponentially increased the availability of DNA sequence data, presenting new opportunities and challenges in bioinformatics. This thesis addresses the critical need for scalable and precise computational tools to enhance taxonomic classification and clustering of DNA sequences through the application of advanced machine learning techniques. We introduce several novel algorithms designed to improve the accuracy and efficiency of these processes, focusing on both supervised and unsupervised machine learning approaches. Firstly, we introduce "DeLUCS'', a deep learning-based method for the unsupervised clustering of DNA sequences. DeLUCS utilizes invariant information clustering to optimize the grouping of sequences without prior taxonomic information. We validate this model across multiple genomic datasets, including vertebrate mitochondrial genomes, bacterial genome segments, and viral sequences. DeLUCS significantly outperforms traditional methods such as K-Means and Gaussian Mixture Models, establishing its effectiveness for analyzing large, unlabelled DNA datasets. Secondly, we explore the taxonomic classification of emerging astroviruses using a hybrid machine learning approach that effectively combines supervised and unsupervised techniques. Our novel methodology integrates k-mer composition analysis of whole genomes with host species data, which is crucial for accurately identifying and classifying novel and as-yet unclassified astroviruses. This approach addresses the challenges posed by genetic recombination and broad interspecies transmission that traditional host-based classifications fail to accommodate. By applying our method, we successfully proposed genus labels for 191 previously unclassified astrovirus genomes, and further identified potential cross-species infections, demonstrating the need for a revised understanding of astrovirus taxonomy. Additionally, we present "CGRclust'', a novel method employing twin contrastive learning with convolutional neural networks (CNNs) for clustering Chaos Game Representations of DNA sequences. CGRclust leverages a unique data augmentation strategy and advanced model architecture and improves upon traditional sequence classification by avoiding the need for DNA sequence alignment or taxonomic labels. This approach demonstrates robust performance, outperforming existing methods like DeLUCS and MeShClust v3.0 with superior clustering accuracy on diverse datasets, including mitochondrial genomes and viral sequences. Our comprehensive evaluations illustrate that these methods significantly advance the accuracy and computational efficiency of genomic data analysis. Specifically, our novel algorithms, including DeLUCS and CGRclust, leverage deep learning and contrastive learning models to refine the classification of DNA sequences, without the dependency on prior taxonomic knowledge or sequence alignment. These improvements demonstrate how machine learning can transform the field of genomic analysis, setting a new standard for taxonomic classification and providing a foundation for future explorations in evolutionary biology and biodiversity.

Description

Keywords

Machine Learning, Genomics, DNA Clustering, NATURAL SCIENCES::Chemistry::Theoretical chemistry::Bioinformatics, DNA Classification

LC Subject Headings

Citation