Advanced Machine Learning Techniques for Taxonomic Classification and Clustering of DNA Sequences

dc.contributor.advisorKari, Lila
dc.contributor.advisorLu, Yang Young
dc.contributor.authorAlipour, Fatemeh
dc.date.accessioned2025-01-09T13:41:01Z
dc.date.available2025-01-09T13:41:01Z
dc.date.issued2025-01-09
dc.date.submitted2025-01-03
dc.description.abstractAdvancements in genomic sequencing have exponentially increased the availability of DNA sequence data, presenting new opportunities and challenges in bioinformatics. This thesis addresses the critical need for scalable and precise computational tools to enhance taxonomic classification and clustering of DNA sequences through the application of advanced machine learning techniques. We introduce several novel algorithms designed to improve the accuracy and efficiency of these processes, focusing on both supervised and unsupervised machine learning approaches. Firstly, we introduce "DeLUCS'', a deep learning-based method for the unsupervised clustering of DNA sequences. DeLUCS utilizes invariant information clustering to optimize the grouping of sequences without prior taxonomic information. We validate this model across multiple genomic datasets, including vertebrate mitochondrial genomes, bacterial genome segments, and viral sequences. DeLUCS significantly outperforms traditional methods such as K-Means and Gaussian Mixture Models, establishing its effectiveness for analyzing large, unlabelled DNA datasets. Secondly, we explore the taxonomic classification of emerging astroviruses using a hybrid machine learning approach that effectively combines supervised and unsupervised techniques. Our novel methodology integrates k-mer composition analysis of whole genomes with host species data, which is crucial for accurately identifying and classifying novel and as-yet unclassified astroviruses. This approach addresses the challenges posed by genetic recombination and broad interspecies transmission that traditional host-based classifications fail to accommodate. By applying our method, we successfully proposed genus labels for 191 previously unclassified astrovirus genomes, and further identified potential cross-species infections, demonstrating the need for a revised understanding of astrovirus taxonomy. Additionally, we present "CGRclust'', a novel method employing twin contrastive learning with convolutional neural networks (CNNs) for clustering Chaos Game Representations of DNA sequences. CGRclust leverages a unique data augmentation strategy and advanced model architecture and improves upon traditional sequence classification by avoiding the need for DNA sequence alignment or taxonomic labels. This approach demonstrates robust performance, outperforming existing methods like DeLUCS and MeShClust v3.0 with superior clustering accuracy on diverse datasets, including mitochondrial genomes and viral sequences. Our comprehensive evaluations illustrate that these methods significantly advance the accuracy and computational efficiency of genomic data analysis. Specifically, our novel algorithms, including DeLUCS and CGRclust, leverage deep learning and contrastive learning models to refine the classification of DNA sequences, without the dependency on prior taxonomic knowledge or sequence alignment. These improvements demonstrate how machine learning can transform the field of genomic analysis, setting a new standard for taxonomic classification and providing a foundation for future explorations in evolutionary biology and biodiversity.
dc.identifier.urihttps://hdl.handle.net/10012/21323
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://github.com/fatemehalipour/CGRclust
dc.relation.urihttps://github.com/fatemehalipour/3PCM
dc.relation.urihttps://github.com/millanp95/DeLUCS
dc.subjectMachine Learning
dc.subjectGenomics
dc.subjectDNA Clustering
dc.subjectNATURAL SCIENCES::Chemistry::Theoretical chemistry::Bioinformatics
dc.subjectDNA Classification
dc.titleAdvanced Machine Learning Techniques for Taxonomic Classification and Clustering of DNA Sequences
dc.typeDoctoral Thesis
uws-etd.degreeDoctor of Philosophy
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorKari, Lila
uws.contributor.advisorLu, Yang Young
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Alipour_Fatemeh.pdf
Size:
17.04 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: