Maleki, Danial2023-09-182023-09-182023-09-13http://hdl.handle.net/10012/19872In recent years, the exponential growth of data across various domains has necessitated the development of advanced techniques to process and analyze multi-modal big data. This is particularly relevant in the medical domain where data comes in diverse formats, such as images, reports, and molecular data. Consequently, bidirectional cross-modal data retrieval has become crucial for numerous research disciplines and domains. Cross-modal retrieval seeks to identify a shared latent space where different modalities, such as image-text pairs, are closely related. Obtaining high-quality vision and text embeddings is vital for achieving this objective. Although training language models is feasible due to the availability of public data and the absence of labelling requirements, training vision models to generate effective embeddings can be challenging due to the scarcity of labelled data when relying on supervised models. To address this challenge, an end-to-end approach to learning vision embeddings in a self-supervised manner, coined H-DINO+LILE, is introduced through a modification of the DINO model. The suggested innovation to improve the DINO model involves transforming the existing local and global patching scheme into a new harmonizing patching approach, termed H-DINO, where the magnitude of various augmentations is consistently maintained. This method captures the contextual information of images more consistently, thereby improving feature representation and retrieval accuracy. Furthermore, a unique architecture is proposed that integrates self-supervised learning and cross-modal retrieval modules in a back-to-back configuration, enabling improved representation of cross-modal and individual modalities using self-attention and cross-attention modules. This architecture features end-to-end training with a new loss term that facilitates image and text representation in the joint latent space. The efficacy of the proposed framework is validated on various private and public datasets across diverse tasks such as patch-based (sub-images) and WSI-based (whole slide images) retrieval, as well as text retrieval tasks. This thesis demonstrates that the proposed framework significantly bolsters cross-modal retrieval within the medical domain. Moreover, its applicability extends beyond the medical field, as it can be utilized in other domains that require cross-modal retrieval and contain patching of gigapixel images in their methodologies.enMachine LearingSelf Supervised LearningCross Modality RetrievalDigital PathologyHarmonizing the Scale: An End-to-End Self-Supervised Approach for Cross-Modal Data Retrieval in Histopathology ArchivesDoctoral Thesis