Democratizing and Modernizing Information Access: From Open Rerankers to Scalable RAG Evaluation
Loading...
Date
Authors
Advisor
Lin, Jimmy
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Modern information access increasingly relies on complex pipelines involving large language models (LLMs), fundamentally changing how users interact with information, from sophisticated multi-stage retrieval pipelines to end-to-end retrieval-augmented generation (RAG) systems. While these advancements enhance user experience, they also introduce significant challenges. The research community's growing reliance on proprietary, black-box models for key tasks like document reranking creates barriers to innovation and reproducibility (the Component Challenge). Furthermore, progress is hampered by the lack of a shared, standardized ecosystem for executing and measuring information access systems (the Benchmarking Challenge). Finally, the generative nature of RAG systems makes them fundamentally harder to evaluate than traditional systems that return document lists; new methodologies are required to assess factual accuracy and completeness in a reliable, scalable manner (the Evaluation Challenge). We argue that progress depends on the synergistic development of open, high-effectiveness system components and the reliable, scalable evaluation frameworks necessary to assess them.
This thesis addresses these challenges through a narrative arc that begins with pushing existing paradigms to their limits. But, given that frontier models dominate today's landscape, we find a pressing need for new open-source solutions. We begin by analyzing the dominant supervised ranking paradigm, developing multi-stage pipelines that demonstrated high effectiveness but also highlighted inherent complexity and cost. Subsequently, we conducted a systematic exploration of model backbones, loss functions, and negative mining strategies to squeeze effectiveness gains from supervised pointwise cross-encoders. Next, we continue with a large-scale empirical study on the newly evolving generative retrieval paradigm, which revealed its scalability limitations on large, real-world collections. This portion culminates in the final contribution to the Component Challenge: RankZephyr, an open-source 7B-parameter listwise reranker. By leveraging a carefully designed instruction distillation curriculum, RankZephyr matches and often surpasses the effectiveness of much larger proprietary models like GPT-4. It provides the community with a powerful, transparent, and accessible zero-shot reranking module, breaking the dependence on black-box systems for this critical task. All methods described have broad community adoption, and our models and evaluation frameworks continue to support ongoing research efforts across open-source IR and RAG development.
With powerful open components in hand, the focus shifts to benchmarking. To address the Benchmarking Challenge, this work introduces Ragnarök, a reusable, end-to-end RAG framework designed to standardize how retrieval-augmented generation systems are constructed and assessed. Serving as the backbone for the TREC 2024 Retrieval-Augmented Generation Track, Ragnarök provides the research community with a shared experimental platform, critical data resources, and reproducible and effective baselines. By encapsulating the full RAG pipeline — from retrieval and grounding to generation and scoring — within a single, transparent framework, TREC 2024 Retrieval-Augmented Generation Track and Ragnarök enable reproducible experimentation at scale. This not only ensures fair comparisons across diverse approaches but also establishes a foundation for cumulative progress in open-domain information access research, where previously ad hoc and non-replicable setups have often impeded reliable evaluation.
Building on this infrastructure, the thesis then directly tackles the Evaluation Challenge by introducing the AutoNuggetizer framework. This framework refactors the classic and well-studied nugget-based evaluation methodology for the modern era of LLMs. By automating the evaluation of the recall of the information nugget in RAG responses and validating the approach at scale in TREC 2024 Retrieval-Augmented Generation Track, this work provides a reliable and scalable methodology to measure the quality of generative information access systems.
In summary, this thesis contributes to the field of information access by exploring the limits of existing retrieval and ranking paradigms, developing powerful open-source components for modern information access systems, and creating the frameworks and methodologies required to benchmark and evaluate them. The contributions include a comprehensive analysis of supervised ranking and generative retrieval paradigms, an open-source state-of-the-art listwise reranker (RankZephyr), a standardized framework for RAG benchmarking (Ragnarök), and a scalable methodology for evaluating generative systems (AutoNuggetizer). Together, this thesis addresses the three core challenges identified at the outset, providing the community with both the tools to build effective systems and the methodologies to assess them rigorously. The widespread adoption of these artifacts by researchers and practitioners already underscores their tangible impact and utility in driving the field forward.
In the future, on the reranking front, we would like to build faster, more efficient rerankers that can reason over the texts and generalize to several domains. On the benchmarking front, we will expand tasks to capture "deep research" information needs that demand multi-hop reasoning and query decomposition. On the evaluation front, we hope to extend the AutoNuggetizer methodology to several tasks that go beyond web retrieval, into other domains like biomedical texts and conversational question answering.