Thakur, Nandan2026-04-272026-04-272026-04-272026-03-23https://hdl.handle.net/10012/23053Modern-day applications increasingly rely on retrieval models and large language models (LLMs) to solve natural language processing (NLP) tasks that are knowledge-intensive, such as information retrieval (IR), question-answering, or fact-checking. However, recent progress has been shaped by assumptions that are increasingly misaligned with real-world deployment, that scaling training data and model size reliably improves out-of-domain robustness, and that leaderboard performance on static, English-centric benchmarks is a meaningful proxy for generalization. LLMs face limitations, especially in such tasks, producing "hallucinations" when handling queries beyond their parametric knowledge. Retrieval-augmented generation (RAG) enhances LLMs by retrieving relevant documents from an external knowledge base via a retrieval system. Similarly, with the onset of pretrained transformers, retrieval systems have demonstrated strong downstream accuracy in the in-domain setting. However, their robustness in the zero-shot scenario remains limited, suggesting that these assumptions do not consistently hold in practice and that current benchmarks and training datasets fail to capture real-world distribution shifts. Despite recent advances, the progress of retrieval and RAG systems remains hindered by three fundamental challenges. First, existing retrieval benchmarks are largely static, English-focused, and evaluate on homogeneous domains, limiting their ability to measure generalization under domain shift, multilinguality, and evolving corpora. Second, the training data used to fine-tune retrievers is often noisy, imbalanced, sparsely annotated, and scarcely available in non-English languages, so that increasing scale through mixtures of supervised and synthetic data does not reliably translate into improved robustness without principled curation. Third, evaluating RAG systems requires judging not only retrieval relevance but also factuality, grounding, and completeness of long-form answers, for which traditional retrieval metrics are insufficient and large-scale human evaluation is impractical. Together, these limitations constrain reliable progress and obscure real-world performance, motivating the need for realistic benchmarks, high-quality training data, and scalable, trustworthy evaluation methodologies. Given the breadth of the challenges outlined, this thesis will focus on all three aspects critical to studying and improving robustness in retrieval and RAG systems. The first part focuses on retrieval benchmarks. It revisits argument retrieval within the BEIR benchmark, analyzing why neural retrievers underperform BM25, and improves benchmark robustness through corpus denoising and post-hoc relevance assessment. Subsequently, it introduces MIRACL, a large-scale multilingual ad hoc retrieval benchmark spanning 18 languages with 78k queries and 726k human judgments, improving systematic evaluation of retrieval systems through better coverage, across high- and low-resource languages. To move beyond static and fixed corpora, the thesis further proposes FreshStack, an automatic framework for constructing realistic benchmarks on technical and niche recent domains to avoid data contamination, by combining automatic corpus collection, automatic nugget generation, and nugget-level support assessment using LLM as judges and hybrid retrieval systems. The second part examines the role of training data quality for fine-tuning retrieval systems. To alleviate the data scarcity across non-English languages, the thesis presents SWIM-IR, by generating a large-scale synthetic multilingual training dataset covering 33 languages and 28 million query–document pairs, and shows that multilingual dense retrievers fine-tuned on synthetic data only can match or exceed the downstream accuracy of supervised models. It further studies the quality of supervised retrieval data, demonstrating that dataset pruning and false-negative relabeling using cascading LLM judges in RLHN (ReLabeling Hard Negatives) yield consistent improvements in out-of-domain robustness across heterogeneous domains for dense retrievers and rerankers. The final part of the thesis addresses the evaluation of retrieval-augmented generation. It introduces Ragnarök, a standardized benchmarking framework, implemented through the TREC 2024 RAG track with the MS MARCO V2.1 collection, topics, a unified input–output format, and strong retrieval and reranking baselines. It further investigates support assessment in RAG evaluation at scale, showing substantial agreement between human judgments and LLM-based judges, thereby validating their use as practical evaluators in the TREC RAG track. Further, the thesis presents NoMIRACL, a human-annotated dataset for relevance assessment in multilingual RAG, revealing persistent trade-offs between abstention and correctness in multilingual LLMs. Finally, it introduces MIRAGE-Bench, a synthetic arena-style multilingual RAG benchmark that trains a surrogate judge from heuristic signals and LLM preferences, enabling scalable and cost-effective evaluation with high agreement with LLM-based rankings. Overall, this thesis advances the methodological foundations of robust retrieval and retrieval-augmented generation by demonstrating that reliable progress is not limited to model architectures and scaling, but also depends on realistic benchmarks, high-quality training data, and principled, scalable evaluation frameworks. By grounding evaluation in heterogeneous domains and languages, the thesis provides a more reliable basis for measuring progress and deploying robust retrieval and RAG systems in real-world settings.eninformation retrievalnatural language processingretrieval-augmented generationlarge language modelsbenchmarkingdatasetsevaluationmultilingualzero-shot generalizationdense retrievaldata qualityllm-as-judgesynthetic data generationBenchmarks, Data, and Evaluation for Robust Retrieval and Retrieval-Augmented Generation on Heterogeneous Domains and LanguagesDoctoral Thesis