Computer Science
Permanent URI for this collectionhttps://uwspace.uwaterloo.ca/handle/10012/9930
This is the collection for the University of Waterloo's Cheriton School of Computer Science.
Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).
Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.
Browse
Recent Submissions
Item type: Item , Categories as a Foundation for both Learning and Reasoning(University of Waterloo, 2026-01-21) Shaw, NolanThis thesis explores two distinct research topics, both applying category theory to machine learning. The first topic discusses Vector Symbolic Architectures (VSAs). I present the first attempt at formalising VSAs with category theory. VSAs are built to perform symbolic reasoning in high-dimensional vector spaces. I present a brief literature survey demonstrating that the topic is currently completely unexplored. I discuss some desiderata for VSA models, then describe an initial formalisation that covers two of the three desiderata. My formalisation focuses on two of the three primary components of a VSA: binding and bundling, and presents a proof of why element-wise operations constitute the ideal means of performing binding and bundling. The work extends beyond vectors, to any co-presheaves with the desired properties. For example, GHRR representations are captured by this generalisation. The second line of work discusses, and expands upon, recent work by Milewski in the construction of "pre-lenses." This work is motivated by pre-established formalisations of supervised machine learning. From the perspective of category theory, pre-lenses are interesting because they unify the category Para, or Learn, with its dual co-Para, or co-Learn. From a computer science perspective, pre-lenses are interesting because they enable programmers to build neural networks with vanilla function composition, and they unify various network features by leveraging the fact that they are profunctors. I replicate Milewski's code, extend it to the non-synthetic data, MNIST, implement re-parameterisations, and describe generative models as dual to discriminative models by way of pre-lenses. This work involved creating a simple dataloader to read in external files, randomising the order that inputs are presented during learning, and fixing some bugs that didn't manifest when training occurred on the very small dataset used by Milewski.Item type: Item , Integrating Symbolic Reasoning into Large Language Models(University of Waterloo, 2026-01-20) Dhanraj, VarunLarge language models (LLMs) face fundamental challenges in symbolic reasoning, struggling with tasks requiring precise rule-following, logical consistency, and manipulation of structured representations. This thesis introduces a comprehensive neurosymbolic framework that addresses these limitations by integrating Vector Symbolic Algebras (VSAs) directly into the computational flow of transformer-based language models. Our core method encodes LLM hidden states into compositional neurosymbolic vectors, enabling symbolic algorithms to operate within a high-dimensional vector space before decoding results back into the neural network's processing pipeline. We demonstrate that LLMs naturally develop internally separable representations for symbolic concepts, which our linear and transformer-based encoders can extract with high fidelity. On mathematical reasoning tasks, our approach achieves 88.6\% lower cross-entropy loss and solves 15.4 times more problems correctly compared to chain-of-thought prompting and LoRA fine-tuning, while preserving performance on non-mathematical tasks through selective intervention. Beyond arithmetic, we extend this framework to three applications. First, we enable language-only models to perform visual question answering by encoding segmented images as queryable VSA representations, achieving 92% accuracy without requiring multimodal architectures. Second, we demonstrate environment navigation where LLMs use spatial semantic pointers to interpret and act upon grid-based worlds according to natural language instructions. Third, we address the context length limitations of LLMs by compressing reasoning histories into VSA representations, maintaining performance on iterative problem-solving tasks while avoiding quadratic scaling costs. Our results establish VSA-based neurosymbolic integration as a practical approach for augmenting neural language models with symbolic reasoning capabilities, providing both theoretical insights into LLM representations and practical improvements across diverse reasoning tasks. This work contributes to the broader goal of creating AI systems that combine the flexibility of neural networks with the precision and interpretability of symbolic computation. Code and data are available at https://github.com/vdhanraj/Neurosymbolic-LLM.Item type: Item , Path Reduction and Coverage Complexity for Fuzzing(University of Waterloo, 2026-01-20) Wang, ZekunCoverage-guided fuzzing is one of the most effective approaches to automated software testing, yet its performance depends critically on the coverage metric that guides input generation. It is widely assumed that finer metrics —especially path coverage, which cap- tures complete control-flow information— should lead to more effective fuzzing. However, practical realizations of path coverage have been limited to restricted forms due to path explosion. In this work, we introduce a path reduction algorithm that bounds loop iterations in execution paths, enabling a practical form of path coverage that preserves essential control- flow information. Despite this advancement, we find that path coverage performs no better than existing metrics such as edge coverage. To understand this phenomenon, we establish the concept of coverage complexity—a quantitative measure of the granularity of coverage metrics. Analogous to complexity and the Big-Onotation in algorithm analysis, coverage complexity classifies metrics into asymptotic complexity classes such as linear, polynomial, and exponential. This framework provides a structured overview of the entire space of coverage metrics, and guides the design of new coverage metrics. Our complexity analysis and empirical evaluation on the MAGMA benchmark reveals a consistent pattern: metrics within the same complexity class tend to exhibit similar fuzzing performance, where linear-complexity metrics consistently outperform more complex met- rics. This suggests a simple but powerful principle: when designing a new coverage metric, the first step is to determine its complexity class, which serves as an early predictor of its potential performance. Since higher-complexity metrics consistently underperform, our results imply that the family of linear metrics may already represent the optimal fron- tier of coverage-guided fuzzing, offering—for the first time—a structured overview of the landscape of coverage metrics.Item type: Item , Pushing the Limit of Language-Agnostic Program Reduction(University of Waterloo, 2026-01-20) Xu, ZhenyangProgram reduction is a widely used technique for testing and debugging language processors. Given a program that triggers a bug in a language processor, program reduction searches for a canonical and minimal program that triggers the same bug, thereby facilitating bug deduplication and simplifying debugging. Among various reduction approaches, language-agnostic reducers (AGRs) have emerged as an important class of techniques because they do not rely on language-specific knowledge and can thus be applied across a wide range of programming languages. This generality makes AGRs especially valuable for languages lacking specialized reduction tools. However, previous AGRs support only a limited set of program transformations, which restricts their minimization and canonicalization capability and results in substantial performance gap compared to language-specific reducers (SPRs). This thesis aims to enhance both the canonicalization and minimization capabilities of AGRs, thereby narrowing the performance gap between AGRs and SPRs. It comprises the following three contributions. The first work aims to improve the reduction capability of AGRs by enabling them to integrate more transformations in an efficient way. As previously mentioned, previous AGRs support only a limited set of transformations. Once a 1-minimal result is obtained and no further transformation can reduce the program, the reduction process terminates. However, such a 1-minimal result may still contain excessive bug-irrelevant program elements. To address this limitation, this work proposes a framework named Vulcan. Vulcan employs an AGR as the main reducer and introduces a set of auxiliary reducers that perform diverse program transformations. When the main reducer can no longer make progress, Vulcan invokes one of its auxiliary reducers to create new reduction opportunities, and then re-applies the main reducer to further minimize the program. In addition to the framework, this work also presents three example program transformations: Identifier Replacement, Subtree Replacement, and Tree-Based Local Exhaustive Enumeration. Evaluation on a multilingual benchmark suite (referred to as Benchmark-Reduce) which includes C, Rust, and SMT-LIBv2 programs, demonstrates that Vulcan outperforms the state-of-the-art AGR, Perses, in terms of minimization. On average, Vulcan produces results with 33.55%, 21.61%, and 31.34% fewer tokens than Perses on C, Rust, and SMTLIBv2 benchmarks, respectively. The second work focuses on enhancing the canonicalization capability of AGRs. A reducer with strong canonicalization capability can minimize differences among programs that trigger the same bug, thereby greatly facilitating bug deduplication. However, prior AGRs exhibit poor canonicalization capability, primarily because they treat tokens as atomic and irreducible units. To address this limitation, this work proposes T-Rec, a fine-grained, lexical syntax–guided program reduction technique that can effectively reduce and canonicalize each token in a program. Evaluation results show that integrating T-Rec into Vulcan enables the elimination of 1,315 additional duplicates in a benchmark suite containing 3,796 programs that expose 46 unique bugs (referred to as Benchmark-Cano). Moreover, T-Rec further reduces the size of Vulcan’s results on Benchmark-Reduce by up to 53.73% in terms of bytes. The third work aims to further enhance both the minimization and canonicalization performance of AGRs by introducing additional program transformations. Specifically, this work proposes SFC, a novel syntax-guided transformation technique that has been overlooked by prior syntax-guided AGRs. To apply SFC effectively and efficiently in program reduction, three SFC-based reduction methods are designed: Smaller Structure Replacement, Identifier Elimination, and Structure Canonicalization. Evaluation results show that integrating these SFC-based methods into Vulcan yields an average 8.2% reduction in output size on Benchmark-Reduce. Moreover, when combined with T-Rec, the SFC-based methods enable Vulcan to eliminate an additional 435 duplicates in Benchmark-Cano. Collectively, these studies significantly advance the effectiveness of language-agnostic program reduction in both minimization and canonicalization. By integrating the proposed approaches, the prior state-of-the-art AGR, Perses, can produce results that are on average 43% smaller on Benchmark-Reduce and eliminate 1,750 additional duplicates in Benchmark-Cano.Item type: Item , Contextual AI: Integrating Macro-Context with Transformer Architectures for Social Media Analysis, Federated Learning, and Recommender Systems(University of Waterloo, 2026-01-20) Hebert, LiamContext is crucial for understanding the world and making informed decisions. While existing transformer architectures excel at contextualizing information locally, such as other words in a sentence, they fail to factor in broader, macro-level contexts. We identify available yet underutilized macro contexts in three use cases: online discussions, federated learning, and recommender systems. For each, we motivate the need to leverage existing macro context and propose context-aware solutions based on the transformer architecture. In online discussion boards, the rich conversational and multimodal macro context in which a comment is made is often overlooked. This is especially pertinent in hate speech detection. Classical solutions that examine individual comments in isolation fail to account for this context, leading to ambiguity and misinterpretation. For instance, the comment ``Ew, that’s gross!'' has a different interpretation depending on whether it’s in response to food or sensitive issues like LGBTQ rights. Furthermore, images that accompany text can also provide crucial context. We propose mDT, a novel deep learning model architecture based on graph transformer networks, which incorporates this valuable context when evaluating the hatefulness of individual comments. Our experimental results demonstrate a 7\% F1 improvement over existing baselines that do not utilize this context, and a 21\% F1 improvement over previous graph-based methods. Second, we tackle the context-agnostic paradigm of federated learning. The prevalent Federated Averaging (FedAvg) method statically averages model weights, failing to account for the crucial macro-level context of heterogeneous-agent environments, leading to a suboptimal, one-size-fits-all model. For example, autonomous driving agents exploring rural roads acquire different knowledge than those in urban settings, and this environmental context is lost in the process. We propose FedFormer, a novel federation strategy that leverages transformer attention to enable each agent to weigh and selectively incorporate insights from its peers in a context-dependent manner. In turn, FedFormer enables a more effective, efficient federation that respects and adapts to environmental diversity while preserving privacy. Our experiments across environments in MetaWorld, a set of heterogeneous robotic manipulation tasks, demonstrate improvements of 1.48x to 3.41x over FedAvg. Finally, in recommender systems, the user’s intent can provide critical personalization context. Simple approaches rely on collaborative filtering, which only models implicit (micro-level) user preferences by extrapolating from historical data. Our solution, Flare, proposes a contextual recommender system that empowers users to steer recommendations via explicit natural language queries (e.g., ``Staplers'', ``Webcams''). Flare’s architecture fuses collaborative filtering signals with semantic representations of both the user’s explicit query and item descriptions, bridging the gap between long-term preferences and the context of the user's immediate goals. Our experiments using the Amazon Product Reviews datasets show a 1.7x and 2.53x increase in recall@1 and recall@10, respectively, compared to approaches that do not factor in user intent.Item type: Item , On the Generalizability of AI-Generated Text Detection(University of Waterloo, 2026-01-20) David, AmirAs large language models (LLMs) become ubiquitous, reliably distinguishing their outputs from human writing is critical for academic integrity, content moderation, and preventing model collapse from synthetic training data. This thesis examines the generalizability of LLM-text detectors across evolving model families and domains. We compiled a comprehensive evaluation dataset from commonly-used human corpora and generated corresponding samples using recent OpenAI and Anthropic models spanning multiple generations. Comparing the state-of-the-art zero-shot detector (Binoculars) against supervised RoBERTa/DeBERTa classifiers, we arrive at four main findings. First, zero-shot detection fails on newer models. Second, supervised detectors maintain high TPR in-distribution but exhibit asymmetric cross-generation transfer. Third, commonly reported metrics such as AUROC can obscure poor performance at deployment-relevant thresholds: detectors achieving high AUROC yield near-zero TPR at low FPR, and existing low-FPR evaluations often lack statistical reliability due to small sample sizes. Fourth, through tail-focused training and calibration, we reduce FPR by up to 4× (from ~1% to ~0.25%) while maintaining 90% TPR. Our results suggest that robust detection requires continually re-calibrated, model-aware pipelines rather than static universal detectors.Item type: Item , Time stepping methods for coupled fluid-rigid body simulation(University of Waterloo, 2026-01-19) Gurditta, RikinInteraction between fluids and solid objects is ubiquitous in everyday life, yet the resulting motion is too intricate for visual effects artists and animators to realistically depict by hand. Instead, artists turn to computer graphics applications that employ physics-based animation to simulate these complex phenomena. Some of these applications solve the incompressible Euler equations coupled with the rigid body equations to compute the motion of an incompressible fluid interacting with undeformable solids. Of particular interest is two-way coupling, in which the fluid and solids both affect each other’s motion. Many methods have been developed to improve the realism of fluid simulations, allowing them to simulate more compelling scenarios. There are several time stepping schemes for fluid simulation in the literature, presenting ways to evolve the motion of the fluid over time that may generate more energetic or more accurate results. In particular, we focus on the BDF2 and Advection-Reflection families of schemes due to their accuracy and their improved ability to preserve the kinetic energy of the fluid. Our goal in this thesis is to extend these time stepping schemes to two-way coupled fluid-rigid body simulation, to yield more compelling simulations of the interactions between these two types of materials. We catalogue some of the popular time stepping schemes for fluid simulation, and explain their relations to methods of solving ordinary differential equations. Then, taking as our starting point the popular method of Batty et al., we re-derive the time stepping scheme originally proposed for coupled systems, and derive new schemes for coupled systems corresponding to the previously discussed fluid schemes, along the way comparing to the coupled time stepping scheme proposed by Gibou and Min. We measure the accuracy, energy-preservation, and computational cost properties of each scheme implemented within a 2D simulation, presenting quantitative and qualitative results. We hope our work encourages further investigation into the theoretical basis as well as the qualitative properties of coupled fluid-rigid body simulation.Item type: Item , New Methods for Analyzing the Properties of Automatic Sequences(University of Waterloo, 2026-01-19) Khodier, MazenAutomatic sequences and morphic words lie at the intersection of automata theory, logic, and combinatorics on words. Many of their structural properties can be formulated as logical predicates over integer representations and decided using automata. This thesis presents automata-based methods for efficiently constructing and verifying deterministic finite automata corresponding to such predicates, and builds on this foundation to analyze key combinatorial properties of morphic words, including the critical exponent and subword complexity. In the first part of this thesis, Chapters 2 to 4, we introduce the notion of self-verifying predicates, which are logical predicates capable of verifying their own correctness. We show how this property enables verification of candidate automata through a small set of inductive conditions and allows the corresponding automata to be constructed deterministically rather than through heuristic guessing. Building on Angluin’s L* learning algorithm, we demonstrate that for such predicates, the associated minimal automata can be generated in time polynomial in the size of both the automaton for the underlying sequence and the resulting automaton, thereby avoiding potentially extremely large intermediate automata that sometimes arise in Walnut. In particular, we give effective constructions for the equality-of-factors predicate, which is used extensively in the second half of the thesis, as well as for other self-verifying predicates, including periodicity of factors, addition relations for numeration systems, and summation of synchronized sequences. The second part, Chapters 5 to 7, applies the previously constructed equality-of-factors predicate to investigate two central combinatorial measures of infinite words: the critical exponent and the subword complexity. Although binary 3-uniform morphisms are used as illustrative examples, the methods generalize naturally to all binary uniform morphisms. For the critical exponent, we present a decision procedure implemented in Walnut that detects whether the exponent is infinite and computes its exact rational value when finite. For subword complexity, we propose two complementary approaches: a constructive method that combines established concepts to produce exact formulas for ρ(n), and a fully deterministic procedure that implements Frid’s approach using Walnut. The new results include explicit subword-complexity formulas for twelve morphisms, and critical-exponent values for ten morphisms. All algorithms and implementations developed in this thesis are made publicly available on the Github repository Cashew as open-source code to support and facilitate further research in combinatorics on words, and automata theory.Item type: Item , Grounded or Guessing? An Empirical Evaluation of LLM Reasoning in Agentic Workflows for Root Cause Analysis in Cloud-based Systems(University of Waterloo, 2026-01-16) Riddell, EvelienRoot cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts, particularly for multi-hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. In particular, LLM-based agents offer autonomous execution and dynamic adaptability with minimal human intervention. However, their practical value for RCA depends on the fidelity of reasoning and decision-making. Existing work relies on historical incident corpora, operates directly on high-volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi-agent pipelines---conditions that obscure whether failures arise from reasoning itself or from peripheral design choices. In this thesis, we present a focused empirical evaluation that isolates an LLM's reasoning behaviour. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan-and-Execute) and a non-agentic baseline on two real-world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totalling 228 days of execution time. We measure both root-cause accuracy and the quality of intermediate reasoning traces. We produce a labelled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation. Our results clarify where current open-source LLMs succeed and fail in multi-hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning-driven system diagnosis.Item type: Item , UringCL: A Lightweight io_uring Convergence Layer for Adoption in Legacy Event Loops(University of Waterloo, 2026-01-16) Afsharian, ArminHigh-performance network servers depend on efficient I/O mechanisms to manage thou- sands of concurrent connections with minimal latency and overhead. While traditional readiness-based interfaces (e.g., select, poll, epoll) notify applications when I/O oper- ations can proceed, they still require synchronous system-calls to execute the operations. This synchronous requirement causes frequent user–kernel transitions, which limits scala- bility under heavy load. In contrast, the io uring interface offers a fundamentally different approach by providing a completion-based I/O model that minimizes system-call overhead and enables true asynchronous data transfer. Although the performance benefits of io uring are well established in storage systems, its integration into high-throughput network applications remains limited. This thesis aims to bridge this integration gap by making the adoption of io uring accessible and provid- ing a structured vehicle for evaluating its performance in network-bound environments. To this end, the io uring Convergence Layer (UringCL) is presented to transparently map the synchronous I/O calls of readiness-driven applications onto asynchronous io uring operations. The UringCL simplifies initialization, event handling, and data transfer while preserving the existing control flow of legacy applications, allowing for incremental migra- tion toward completion-based I/O without major redesign. The UringCL architecture facilitates the practical integration of io uring into estab- lished network architectures and provides a consistent framework for measuring its impact on throughput, latency, and CPU efficiency. Experimental results demonstrate significant performance advantages over traditional models. In bulk-transfer workloads, the system delivers up to 40% higher throughput than epoll due to superior batching capabilities. In request-response scenarios involving Memcached, the integration achieves higher peak throughput and maintains significantly lower and more stable tail latency under heavy load. Furthermore, UringCL achieves these benefits with negligible overhead, proving that completion-based I/O can be adopted seamlessly to enhance the efficiency of modern net- work servers.Item type: Item , GASTON: Graph-Aware Social Transformer for Online Networks(University of Waterloo, 2026-01-16) Wloch, OlhaOnline communities have become essential digital third places for socialization and support, yet they also possess toxicity, echo chambers, and misinformation. Mitigating these harms requires computational models that can understand the nuance of online interactions to accurately detect harmful content such as toxicity and norm violation. This is difficult because the meaning of an individual post is rarely self-contained; it is dynamically constructed through the interplay of what is written (textual content) and where it is posted (social structure). We require models that effectively fuse these two signals to generate representations for online entities such as posts, users, and communities. Current approaches often treat these different signals in isolation: text-only models analyze content but miss the local social norms that define acceptable behavior, while structure-only models map relationships but ignore the semantic content of discussions. Recent hybrid approaches attempt to bridge this gap but some rely on simple text averaging mechanisms to represent a user and a community, and in so doing flatten the rich, norm-defining identity. To address this limitation, this thesis proposes GASTON (Graph-Aware Social Transformer for Online Networks), a graph learning framework designed to capture the essence of online social networks. It does so by modeling connections between all online entities, such as users, communities, and text. This makes it possible to ground user and text representations in their local norms, providing the necessary context to accurately classify behaviour in downstream tasks. The heart of our solution is a contrastive initialization strategy which pre-trains community representations based on user membership patterns, effectively capturing the unique signature of a community's user base before the model processes any text. This allows GASTON to distinguish between communities (e.g., a support group vs. a hate group) based on who interacts there, even if they share similar vocabulary. We evaluate GASTON across a diverse set of socially-aware downstream tasks, including mental health stress detection, toxicity scoring, and norm violation detection. Our experiments demonstrate that GASTON outperforms state-of-the-art baselines, particularly in tasks where social context is critical for classification, such as detecting norm violations. Furthermore, we illustrate that these learned representations provide interpretable insights, offering a path toward user-empowered transparency in online spaces.Item type: Item , Demystifying Foreground-Background Memorization in Diffusion Models(University of Waterloo, 2026-01-07) Di, Jimmy Z.Diffusion models (DMs) memorize training images and can reproduce near-duplicates during generation. Current detection methods identify verbatim memorization but fail to capture two critical aspects: quantifying partial memorization occurring in small image regions, and memorization patterns beyond specific prompt-image pairs. To address these limitations, we propose Foreground Background Memorization (FB-Mem), a novel segmentation-based metric that classifies and quantifies memorized regions within generated images. Our method reveals that memorization is more pervasive than previously understood: (1) individual generations from single prompts may be linked to clusters of similar training images, revealing complex memorization patterns that extend beyond one-to-one correspondences; and (2) existing model-level mitigation methods, such as neuron deactivation and pruning, fail to eliminate local memorization, which persists particularly in foreground regions. Our work establishes an effective framework for measuring memorization in diffusion models, demonstrates the inadequacy of current mitigation approaches, and proposes a stronger mitigation method using a clustering approach.Item type: Item , Cues, Clones, and Cars: Access Control Issues in Customized Android(University of Waterloo, 2025-12-23) Vyas, ParjanyaAndroid’s open-source design and extensive customization have fueled its dominance across smartphones, automotive systems, wearables, and other domains. This flexibility, however, introduces serious security challenges, particularly in the enforcement of access control. Prior research has investigated inconsistencies within the framework, across layers, and across Android versions, yet important gaps remain, especially in detecting vendor-introduced data-driven customizations, replicated APIs, and platform-specific adaptations (e.g., automotive) that are difficult to capture with existing techniques. This dissertation investigates how Android contextual features can be systematically leveraged to uncover access control vulnerabilities that evade prior analyses. It presents four main contributions: - Bluebird: a probabilistic inference framework that derives access control requirements from application-side sensitivity indicators (UI cues and app-side access control). By fusing NLP-driven signals with static analysis, Bluebird identifies APIs whose protections do not match implied sensitivity. Applied to 14 ROMs, Bluebird flagged 391 likely under-protected private APIs.% and supported 11 proof-of-concept exploits. - Ariadne: a static-analysis based technique built around a novel access control dependency graph abstraction that models explicit and inferred access control relationships among framework data holders. Ariadne detects inconsistencies introduced by data-driven vendor customizations that traditional tools miss. Evaluated on AOSP and vendor ROMs, it discovered 30 unique inconsistencies and enabled 13 proof-of-concept exploits. - RepFinder: a large-scale measurement pipeline that identifies duplicated or ``Replica'' APIs created via copy-paste editing and evaluates their access control enforcement. Analyzing 342 ROMs from 10 vendors, RepFinder found replication to be widespread (~141 Replicas/ROM on average) and that a significant fraction (37% on average) of Replicas are under-protected. - AutoAcRaptor: a platform-specific static analysis framework for AAOS that identifies automotive entry points and evaluates both access control and feature-check enforcement. Applied to 10 AAOS ROMs, AutoAcRaptor reported an average of 23 auto feature and access control anomalies per ROM. Collectively, these contributions show that Android contextual features such as app-side sensitivity indicators, framework data holders, and platform-specific service registrations can be systematically harnessed to reveal overlooked access control vulnerabilities. They also demonstrate that techniques for identifying framework customization-induced vulnerabilities can be adapted to emerging Android-based platforms such as Android Automotive OS by accounting for platform-specific differences. Beyond these immediate contributions, this work opens two broader research directions. First, the contextual features explored in this work may not be exhaustive. Future research should aim to identify additional contextual signals—potentially through automated discovery—and explore an integration framework that makes it easy to incorporate new analyses into a unified solution. Second, the adaptation of these techniques to other Android-based platforms remains an open challenge. While AutoAcRaptor demonstrates feasibility for Android Automotive, other platforms such as Android TV, Wear OS, and Android XR present unique differences that require dedicated investigation to determine how well these methods generalize and what extensions are needed.Item type: Item , Pragmatica: A VR Tool for Autonomous Practice During Language Therapy(University of Waterloo, 2025-12-23) Prasad, KarthikAphasia is a communication disorder that affects millions worldwide, but those affected by aphasia have limited access to in-person therapy. They compensate for this with at-home practice, but existing tools are either ineffective or require a clinician to be present. We present Pragmatica, a VR platform that enables people with aphasia to practice their communication skills independently at home through immersive, context-rich activities. In an eight-week case study, we compared Pragmatica with traditional therapy (4 participants per group). With no detected difference in Quick Aphasia Battery (QAB) scores, VR participants engaged in substantial practice (31 hours, 366 activities) and described the VR experience as engaging, fun, and motivating, but had a limited variety of relevant and unique activities. Our study contributes empirical evidence of VR’s feasibility for autonomous language practice, as well as design insights and considerations for accessible, aphasia-friendly VR systems (flexible controls, multi-modal prompts and inputs).Item type: Item , Exploring Voice Agent Gender in a Running Coach Application(University of Waterloo, 2025-12-22) O'Neill, CaseyIn daily life, people commonly use voice agents such as Siri and Alexa to perform everyday tasks. With rapid technological advancements, voice agents are becoming more humanlike and there is a growing interest to use them in high-stakes tasks such as mental health advice and therapy. However, it is important to understand the potential harms posed by these voice agents of the future. This work focuses on gender-based social justice problems introduced by the widespread use of humanlike gendered voice agents, many of which disproportionately affect women due to the high prevalence of commercial voice agents that are female or ``female by default''. To contribute to an understanding of how to design gendered voice agents that reduce these potential harms, we explore reactions to a gendered voice agent running coach. We build a voice agent running coach system for smartphones that includes three voice options (male, female, gender ambiguous) which we validate through an in-person survey study (n = 30). We use our system in a field study (n = 18) in which participants run with the agent for three weeks and attend two in-person or online sessions with the researcher. We present a statistical analysis of survey data and key themes from a reflexive thematic analysis of interview data from the study. We conclude with a discussion of actions designers can take to create gendered voice agents more responsibly. We also offer recommendations for future research into gender and gender-based stereotyping in voice agents.Item type: Item , Parallel Oblivious Joins using Radix Partitioning(University of Waterloo, 2025-12-16) Ahmed, NafisWe present parallel doubly oblivious algorithms for both non-foreign key and foreign key joins using an oblivious radix partitioning technique. Oblivious query processing enables secure execution over encrypted data when organizations outsource data to the cloud. When the cloud server processes encrypted data within hardware enclaves, the data is vulnerable to side-channel leaks caused by data-dependent memory access patterns and control flow. Our algorithms efficiently defend against these vulnerabilities by combining data partitioning with parallel execution. Specifically, we propose a doubly oblivious radix partitioning approach that divides input arrays into disjoint partitions without leaking the multiplicity of individual elements, unlike vanilla radix partitioning. This is especially important for join operations, where duplicate keys are common. To construct our join algorithm, we apply oblivious radix partitioning independently to each input table, allowing the algorithm to compare tuples only within corresponding partitions. When input tables are presorted, our oblivious join algorithm is the first to avoid combining and obliviously resorting them, yielding performance improvements over the state-of-the-art scheme, Obliviator. Beyond joins, our oblivious radix partitioning technique is a standalone primitive with applications to a broad class of problems, including oblivious aggregation and private set intersection.Item type: Item , Algebraic geometric methods for algorithms in satisfiability, irreducibility of varieties, and identity testing(University of Waterloo, 2025-12-16) Garg, AbhibhavIn this thesis we study three problems that lie in the intersection of abstract algebra and theoretical computer science. The first of these is the polynomial identity testing problem, which is the task of determining if an algebraic circuit computes the identically zero polynomial. We give the first polynomial time deterministic algorithm for the special case of depth four algebraic circuits, with top fan-in three, and constant bottom fan-in. We also give the first such algorithm for circuits with bottom fan-in two, and constant top fan-in. Our methods involve studying higher degree generalisations of classical incidence configurations known as Sylvester–Gallai configurations. The second of these is the problem of checking if a system of equations is satisfiable. In the regime when the number of variables in the system is a constant, we show that satisfiability can be checked in constant depth by algebraic circuits. In particular, we show that the multivariate resultant has a constant depth circuit in this regime, independent of the degrees. The previous best known constructions of the resultant required depth that was logarithmic in the degrees. The final problem we consider is the problem of deciding if an ideal theoretically defined variety is irreducible in characteristic 0. We show that this task can be solved in the polynomial hierarchy assuming the generalized Riemann hypothesis. This improves the previous best known bound of PSPACE.Item type: Item , Towards Safe Initialization of Scala Global Objects(University of Waterloo, 2025-12-16) Xing, EnzeThis thesis focuses on safe initialization of global objects in Scala. Global objects encapsulate global information in Scala, and their initialization is susceptible to causing run-time errors. Moreover, global objects are initialized by demand (i.e. on their first access). The initialization safety of a global object is brittle if it depends on the initialization point of the object, because the initialization point is the first access in the entire program. This motivates the idea of automatically detecting potential initialization errors during compilation. The main contribution of this thesis is designing and implementing a global object initialization checker in the Scala compiler. Theoretically, we identified run-time errors caused by unsafe initialization patterns of global objects and organized three static principles to enforce on Scala programs: Prohibiting accesses to uninitialized fields, which prevents null pointer exceptions; partial ordering of global object initialization order, which prevents deadlocks between locks that guard the initialization of global objects; and initialization-time irrelevance, which ensures that initialization safety of the global object is independent of the initialization point. We then designed the global object initialization checker by proposing the formal initialization semantics of a Scala initialization calculus, and the initialization checker is presented as an abstract interpreter of the initialization calculus. The initialization checker also checks the initialization process of each global object individually rather than conducting a whole-program analysis. Practically, we have integrated the abstract interpreter into the Scala compiler after extending the initialization semantics with more Scala features. The initialization checker can be turned on when compiling Scala programs, and we evaluated the initialization checker during the compilation of many widely-used open-source Scala projects which form a test suite. The initialization checker reports warnings in several projects that are verified to be true positives. The result highlights the necessity of checking the initialization safety of Scala projects and the utility of the global object initialization checker in this thesis.Item type: Item , A Framework for Explaining LLM Reasoning with Knowledge Graphs(University of Waterloo, 2025-12-10) Shirdel, MoeinLarge Language Models (LLMs) have demonstrated remarkable question-answering (QA) capabilities, yet their decision processes and outputs often remain opaque and prone to factual inconsistencies. While existing methods evaluate or ground LLM outputs after generation, they typically lack mechanisms for aligning LLM reasoning with external knowledge sources. This thesis introduces Apr`esCoT, a lightweight model-agnostic framework that validates LLM reasoning by grounding it in an external knowledge graph (KG). Apr`esCoT operates through three main components: Subgraph Retrieval, which extracts a KG subgraph relevant to a given query; Triple Extraction and Parsing, which converts the LLM’s output into factual triples; and Matching, which aligns these triples with entities and relations in the extracted KG subgraph. The integration of these modules enables alignment between LLM reasoning and structured knowledge, producing traceable and structured explanations alongside model outputs. We evaluate alternative retrieval and matching strategies, analyze their trade-offs, and demonstrate how Apr`esCoT helps users surface reasoning gaps, hallucinations, and missing facts. Experiments across multiple domains, including large-scale KGs, highlight Apr`esCoT’s effectiveness in advancing trustworthy and explainable AI.Item type: Item , Democratizing and Modernizing Information Access: From Open Rerankers to Scalable RAG Evaluation(University of Waterloo, 2025-12-09) Pradeep, RonakModern information access increasingly relies on complex pipelines involving large language models (LLMs), fundamentally changing how users interact with information, from sophisticated multi-stage retrieval pipelines to end-to-end retrieval-augmented generation (RAG) systems. While these advancements enhance user experience, they also introduce significant challenges. The research community's growing reliance on proprietary, black-box models for key tasks like document reranking creates barriers to innovation and reproducibility (the Component Challenge). Furthermore, progress is hampered by the lack of a shared, standardized ecosystem for executing and measuring information access systems (the Benchmarking Challenge). Finally, the generative nature of RAG systems makes them fundamentally harder to evaluate than traditional systems that return document lists; new methodologies are required to assess factual accuracy and completeness in a reliable, scalable manner (the Evaluation Challenge). We argue that progress depends on the synergistic development of open, high-effectiveness system components and the reliable, scalable evaluation frameworks necessary to assess them. This thesis addresses these challenges through a narrative arc that begins with pushing existing paradigms to their limits. But, given that frontier models dominate today's landscape, we find a pressing need for new open-source solutions. We begin by analyzing the dominant supervised ranking paradigm, developing multi-stage pipelines that demonstrated high effectiveness but also highlighted inherent complexity and cost. Subsequently, we conducted a systematic exploration of model backbones, loss functions, and negative mining strategies to squeeze effectiveness gains from supervised pointwise cross-encoders. Next, we continue with a large-scale empirical study on the newly evolving generative retrieval paradigm, which revealed its scalability limitations on large, real-world collections. This portion culminates in the final contribution to the Component Challenge: RankZephyr, an open-source 7B-parameter listwise reranker. By leveraging a carefully designed instruction distillation curriculum, RankZephyr matches and often surpasses the effectiveness of much larger proprietary models like GPT-4. It provides the community with a powerful, transparent, and accessible zero-shot reranking module, breaking the dependence on black-box systems for this critical task. All methods described have broad community adoption, and our models and evaluation frameworks continue to support ongoing research efforts across open-source IR and RAG development. With powerful open components in hand, the focus shifts to benchmarking. To address the Benchmarking Challenge, this work introduces Ragnarök, a reusable, end-to-end RAG framework designed to standardize how retrieval-augmented generation systems are constructed and assessed. Serving as the backbone for the TREC 2024 Retrieval-Augmented Generation Track, Ragnarök provides the research community with a shared experimental platform, critical data resources, and reproducible and effective baselines. By encapsulating the full RAG pipeline — from retrieval and grounding to generation and scoring — within a single, transparent framework, TREC 2024 Retrieval-Augmented Generation Track and Ragnarök enable reproducible experimentation at scale. This not only ensures fair comparisons across diverse approaches but also establishes a foundation for cumulative progress in open-domain information access research, where previously ad hoc and non-replicable setups have often impeded reliable evaluation. Building on this infrastructure, the thesis then directly tackles the Evaluation Challenge by introducing the AutoNuggetizer framework. This framework refactors the classic and well-studied nugget-based evaluation methodology for the modern era of LLMs. By automating the evaluation of the recall of the information nugget in RAG responses and validating the approach at scale in TREC 2024 Retrieval-Augmented Generation Track, this work provides a reliable and scalable methodology to measure the quality of generative information access systems. In summary, this thesis contributes to the field of information access by exploring the limits of existing retrieval and ranking paradigms, developing powerful open-source components for modern information access systems, and creating the frameworks and methodologies required to benchmark and evaluate them. The contributions include a comprehensive analysis of supervised ranking and generative retrieval paradigms, an open-source state-of-the-art listwise reranker (RankZephyr), a standardized framework for RAG benchmarking (Ragnarök), and a scalable methodology for evaluating generative systems (AutoNuggetizer). Together, this thesis addresses the three core challenges identified at the outset, providing the community with both the tools to build effective systems and the methodologies to assess them rigorously. The widespread adoption of these artifacts by researchers and practitioners already underscores their tangible impact and utility in driving the field forward. In the future, on the reranking front, we would like to build faster, more efficient rerankers that can reason over the texts and generalize to several domains. On the benchmarking front, we will expand tasks to capture "deep research" information needs that demand multi-hop reasoning and query decomposition. On the evaluation front, we hope to extend the AutoNuggetizer methodology to several tasks that go beyond web retrieval, into other domains like biomedical texts and conversational question answering.