Building a Robust Retrieval System with Dense Retrieval Models

Lin, JimmyLin, Sheng-Chieh2024-12-172024-12-172024-12-172024-12-02https://hdl.handle.net/10012/21253First-stage retrieval plays a pivotal role in modern search engines, aiming to gather candidates from a large corpus for various downstream tasks, such as question answering, fact-checking, and retrieval-augmented generation. Historically, retrieval using heuristic lexical representations, such as BM25, has been widely employed as a first-stage retriever. Recently, driven by advancements in pre-trained transformers and approximate nearest neighbor search, transformer-based dense retrieval models have emerged as a strong alternative to BM25. These models address the issue of term mismatch in lexical representations by enabling semantic matching. Unlike BM25, transformer-based dense retrieval models require substantial resources for training, such as human relevance labels and GPUs. Moreover, even with sufficient training data, these models often lag behind BM25 in robustness across different retrieval domains, tasks, and languages, limiting their broader applicability. As an alternative, dense–sparse fusion and multi-vector bi-encoder retrieval models have been shown to capture both semantic and lexical features, demonstrating superior retrieval robustness. However, these designs introduce increased query latency and require a more complex pipeline for indexing and retrieval, making them less suitable for first-stage retrieval systems. To address these challenges and enable a robust yet efficient first-stage retrieval system, this thesis contributes to the field of information retrieval in two key directions. First, we propose efficient training methods to enhance the effectiveness of transformer-based dense retrieval models. Second, we introduce a dense representation framework that simplifies the integration of dense and sparse (semantic and lexical matching) indexing and retrieval. To begin, we propose a method to enhance the efficiency of a widely adopted training technique for dense retrieval models: knowledge distillation. Rather than distilling knowledge from a more powerful cross-encoder ranker, we leverage a multi-vector bi-encoder model (i.e., ColBERT) to train a dense retrieval model, which we call TCT-ColBERT. Specifically, we transfer knowledge from the bi-encoder teacher to the student by distilling ColBERT’s fine-grained scoring function into a simpler dot-product operation. The bi-encoder teacher–student setup offers a significant advantage: it allows for the efficient incorporation of in-batch negatives during knowledge distillation, facilitating richer interactions between the teacher and student models. Our results demonstrate that this approach improves both the training efficiency and the overall effectiveness of dense retrieval models compared to prior methods. Despite the success of dense retrieval models, we observe that they still fall short of multi-vector dense retrieval models and even BM25 in terms of robustness across various retrieval domains and tasks. Recent studies have explored scaling dense retrieval models by increasing both training data and model size. However, this approach necessitates collecting billions of training samples from the web, significantly driving up training costs. To enhance the efficiency of training robust dense retrieval models, we systematically investigate dense retrieval training through the lens of data augmentation and identify a critical factor: diverse query and label augmentation. Consequently, we are the first to empirically demonstrate that a BERT-base-sized dense retrieval model (110M parameters), named DRAGON, can rival the state-of-the-art dense retrieval model, GTR-XXL, with 4.8 billion parameters (40 times larger) in both supervised and zero-shot evaluations. Crucially, instead of relying on billion-scale web-crawled data, our data augmentation approach uses only the 8.8 million passages from the MS MARCO corpus. Subsequently, we explore deploying a more effective hybrid retrieval model efficiently. To achieve this, we propose a dense representation framework for semantic and lexical matching, incorporating a novel textual representation and scoring function. This unified framework simplifies the fusion retrieval of any off-the-shelf dense (semantic) and sparse (lexical) retrievers with a single index. In addition, the fusion retrieval can be accelerated using GPUs and efficient approximate nearest neighbor search through existing libraries. Moreover, this framework enables the development of a newly designed bi-encoder hybrid retrieval model, (DeLADE+[CLS])DHR, which achieves competitive retrieval effectiveness while significantly reducing runtime costs compared to existing bi-encoder multi-vector retrievers. Finally, we introduce a new dense retrieval model, Aggretriever. In contrast to existing dense retrieval models, which derive semantic textual representations solely from [CLS] tokens or average pooling, Aggretriever fully leverages the capabilities of pre-trained transformers by integrating semantic features from [CLS] and lexical features from masked language modeling. Our experiments highlight the advantages of Aggretriever over existing dense retrieval models in terms of training efficiency and robustness. Specifically, Aggretriever achieves competitive retrieval effectiveness in both supervised and zero-shot evaluations while requiring only a single V100 GPU (32GB) and avoiding complex, costly training strategies such as knowledge distillation and hard negative mining. Additionally, we extend Aggretriever to a multilingual dense retrieval model, mAggretriever, demonstrating its superior zero-shot transferability across languages. On multilingual benchmarks, mAggretriever, fine-tuned solely on English training data, outperforms existing multilingual dense retrieval models that rely on computationally expensive pre-training with large-scale multilingual datasets crawled from the web. In conclusion, we summarize our findings and explore potential future research directions. For instance, we highlight the promising opportunities of leveraging large language models (LLMs) to develop more robust retrieval systems. These systems could enable advanced capabilities such as search planning and multimodal retrieval, paving the way for more versatile and effective information retrieval frameworks.eninformation retrievalnatural language processingartificial intelligenceBuilding a Robust Retrieval System with Dense Retrieval ModelsDoctoral Thesis