Pretrained Transformers for Efficient and Robust Information Retrieval
Loading...
Date
2024-08-22
Authors
Advisor
Lin, Jimmy
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Pretrained transformers have significantly advanced the field of information retrieval (IR) since the introduction of BERT. Through unsupervised pretraining followed by task-aware finetuning, the encoder-only BERT effectively captures the complex semantics of texts and overcomes the limitations of traditional bag-of-words systems like BM25, addressing the lexical mismatch between queries and documents. The successor models based on decoder-only pretrained transformers such as GPT and LLaMA have further improved the ranking effectiveness by scaling up the computation. Undoubtedly, these pretrained transformers have made a profound impact on search and ranking, completely changing the landscape of modern IR systems.
Consequently, pretrained transformers are now applied in various stages of a modern IR system, typically comprising a first-stage retriever, a second-stage reranker, and an optional large language model (LLM). For instance, the first-stage retriever can be a bi-encoder that separately encodes the queries and documents, and the encoded collection of documents is indexed into a vector database for nearest neighbor search. Given a query, the first-stage retriever selects the top-k similar documents from the vector database for the second-stage reranker to review. The following second-stage reranker is usually a cross-encoder which evaluates the query and documents jointly, further refining the search results. The final step is to use an LLM to summarize the retrieved information before returning it to the users, which is optional but has become more and more popular in modern search engines.
However, pretrained transformers are not without flaws. Compared to traditional IR methods, the challenges of efficiency and robustness are frequently encountered in practice. For example, first-stage retrievers and their vector indexes often suffer from redundancy, leading to increased search latency and excessive storage requirements. The second-stage rerankers, based on cross-encoder pretrained transformers, also incur high computational costs when reranking a large set of retrieved documents.
In addition, the representations of pretrained transformers are usually lossy compression of the inputs and tied to certain data distributions, making them susceptible to out-of-domain queries and subtle details such as dates and locations. These issues can further impair the downstream LLM, hindering the efficient extraction of useful information from the retrieved documents and promoting hallucinations when processing biased content retrieved by adversarial queries.
Given the breadth of the challenges outlined, this thesis will delineate a specific scope of study for efficiency and robustness. Regarding efficiency, our focus will be on minimizing search latency and storage costs during the first-stage retrieval and second-stage reranking processes, along with reducing the generation latency of the downstream LLM. Regarding robustness, we will concentrate on the reliability of both retrievers and rerankers in handling out-of-domain queries, as well as ensuring the factuality and attribution accuracy in the output of the retrieval-augmented LLM.
Under this context, we outline our efforts to address the efficiency and robustness challenges in major components of a modern IR system based on pretrained transformers, ranging from first-stage retrieval—including dense, sparse, hybrid, and multi-vector retrievers—to second-stage cross-encoder reranking, as well as tackling hallucination and attribution issues in downstream LLMs from a retrieval standpoint.
For first-stage dense retrieval models, we explore index compression to reduce search latency and storage space through dimensionality reduction and product quantization. We also detail our work on improving the robustness of these models through techniques like model ensembling and integration with sparse retrieval methods. In extending these ideas to multi-vector retrieval systems, we employ dynamic lexical routing and token pruning to jointly optimize for efficiency and robustness during finetuning on ranking tasks. Additionally, we examine the integration of sparse retrieval in a multi-vector system and propose sparsified late interactions for efficient retrieval, ensuring compatibility with traditional inverted indexes and improving latency.
In the later sections, we continue to address the efficiency and robustness challenges in second-stage rerankers based on cross-encoders. We introduce a straightforward strategy for cross-encoder rerankers by adding late interaction at the last layer to handle out-of-domain, long-sequence data with minimum latency overhead. We also assess the impact of using LLMs in query expansion for high-precision cross-encoder rerankers in conventional ranking tasks, finding that traditional query expansion methods can undermine the robustness of effective rerankers. After integrating first-stage retrieval with second-stage reranking, we propose a candidate-set pruning method with high-confidence error control to accelerate the reranking process reliably, allowing users to specify their precision goals while maintaining efficiency with theoretical guarantees.
Finally, we use these optimized two-stage retrieval systems to address hallucination and attribution problems in downstream retrieval-augmented LLMs with improved efficiency for generation. We introduce a hybrid generation model that incorporates segments retrieved from the corpus directly into LLM outputs to reduce hallucinations and improve attribution in knowledge-intensive tasks. Our final system leverages our efficient and robust solutions in neural retrieval, delivering precise responses swiftly and accelerating content generation by processing multiple tokens per time step, with a post-hoc revision mechanism to balance robustness and efficiency.
Overall, this thesis aims to improve the efficiency and robustness of key components in a contemporary IR system and their integration with downstream LLMs.
Description
Keywords
information retrieval, natural language processing, pretrained transformers