Computer Science

This is the collection for the University of Waterloo's Cheriton School of Computer Science.

Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).

Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.

Browse

Recent Submissions

Now showing 1 - 20 of 1576
  • Item
    A Comparison of Unsupervised Topic Modelling Techniques for Qualitative Data Analysis of Online Communities
    (University of Waterloo, 2024-07-25) Kaur, Amandeep
    Social media constitutes a rich and influential source of information for qualitative researchers. However, its vast volume and diversity present significant challenges, which can be assisted by computational techniques like topic modelling. But qualitative researchers often struggle to use computational techniques due to a lack of programming expertise and concerns about maintaining the nuanced aspects of their research, such as contextual understanding, subjective interpretations, and ethical considerations of their data. To address this issue, this thesis explores the integration of BERTopic, an advanced Large Language Model (LLM)-based method, into the Computational Thematic Analysis (CTA) Toolkit to support qualitative data analysis of social media. We conducted interviews and hands-on evaluations in which qualitative researchers compared topics from three modeling techniques --- LDA, NMF and BERTopic. Participants prioritized topic relevance, logical organization, and the capacity to reveal unexpected relationships within the data, valuing detailed, coherent clusters for deeper understanding and actionable insights. BERTopic was favored by 8/12 participants for its ability to uncover hidden connections. These findings underscore the transformative potential of LLM-based tools in providing deeper, more nuanced insights for qualitative analysis of social media data.
  • Item
    Efficient Memory Allocator for Restricting Use-After-Free Exploitations
    (University of Waterloo, 2024-07-17) Wang, Ruizhe
    Attacks on heap memory, encompassing memory overflow, double and invalid free, use-after-free (UAF), and various heap-spraying techniques are ever-increasing. Existing secure memory allocators can be generally classified as complete UAF-mitigating allocators that focus on detecting and stopping UAF attacks, type-based allocators that limit type confusion, and entropy-based allocators that provide statistical defenses against virtually all of these attack vectors. In this thesis, I introduce two novel approaches, SEMalloc and S2Malloc, for type- and entropy-based allocation, respectively. Both allocators are designed to restrict, but not to fully eliminate, the attacker's ability, using allocation strategies. They can significantly increase the security level without introducing excessive overheads. SEMalloc proposes a new notion of thread-, context-, and flow-sensitive 'type', SemaType, to capture the semantics and prototype a SemaType-based allocator that aims for the best trade-off amongst the impossible trinity. In SEMalloc, only heap objects allocated from the same call site and via the same function call stack can possibly share a virtual memory address, which effectively stops type-confusion attacks and make UAF vulnerabilities harder to exploit. S2Malloc aims to enhance UAF-attempt detection without compromising other security guarantees or introducing significant overhead. We use three innovative constructs in secure allocator design: free block canaries (FBC) to detect UAF attempts, random in-block offset (RIO) to stop the attacker from accurately overwriting the victim object, and random bag layout (RBL) to impede attackers from estimating the block size based on its address. This thesis demonstrates the importance of memory security and highlights the potential of more secure and efficient memory allocation by constraining attacker actions.
  • Item
    Technology Design Recommendations Informed by Observations of Videos of Popular Musicians Teaching and Learning Songs by Ear
    (University of Waterloo, 2024-07-11) Liscio, Christopher
    Instrumentalists who play popular music often learn songs by ear, using recordings in lieu of sheet music or tablature. This practice was made possible by technology that allows musicians to control playback events. Until now, researchers have not studied the human-recording interactions of musicians attempting to learn pop songs by ear. Through a pair of studies analyzing the content of online videos from YouTube, we generate hypotheses and seek a better understanding of by-ear learning from a recording. Combined with results from neuroscience studies of tonal working memory and aural imagery, our findings reveal a model of by-ear learning that highlights note-finding as a core activity. Using what we learned, we discuss opportunities for designers to create a set of novel human-recording interactions, and to provide assistive technology for those who lack the baseline skills to engage in the foundational note-finding activity.
  • Item
    Quantum Query Complexity of Hypergraph Search Problems
    (University of Waterloo, 2024-07-09) Yu, Zhiying
    In the study of quantum query complexity, it is natural to study the problems of finding triangles and spanning trees in a simple graph. Over the past decades, many techniques are developed for finding the upper and lower quantum query bounds of these graph problems. We can generalize these problems to detecting certain properties of higher rank hypergraphs and ask whether these techniques are still available. In this thesis, we will see that when the rank increase, complexity bounds still holds for some problems, although less effectively. For some other problems, their nontrivial complexity bounds vanish. Moreover, we will focused on using the generalized adversary and learning graph techniques for finding nontrivial quantum query bounds for different hypergraph search problems. The following results are presented. • Discover a general quantum query lower bound for subhypergraph-closed properties and monotone properties over r-partite r-uniform hypergraphs. • Provide tight quantum query bounds for the connectivity and acyclicity problems over r-uniform hypergraphs. • Present a nontrivial learning graph algorithm for the 3-simplex finding problem. • Formulate nested quantum walk in the adaptive learning context and use it to present a nontrivial quantum query algorithm for the 4-simplex finding problem. • Present a natural relationship of lower bounds for simplex finding of different ranks. • Use the learning graph formalization of tetrahedron certificate structure to find a nontrivial quantum query lower bound of the 3-simplex sum problem.
  • Item
    Triangle count estimation and label prediction over uncertain streaming graphs
    (University of Waterloo, 2024-07-09) Mohanty, Ipsita
    This thesis aims to integrate the notions of uncertainty with graph stream process- ing, presenting probabilistic models to enhance real-time analytical capabilities in graph database systems. These systems are crucial for managing interconnected data in various domains, such as social networks, traffic networks, and genomic databases, where data often contains incomplete or probabilistic connections that complicate processing and analysis. We develop and validate two main methodologies: a martingale-based approach for approximating triangle counts in edge uncertain streaming graphs and a Graph Neural Network (GNN)-based method for dynamic label prediction in attribute uncertain stream- ing graphs. Both methods demonstrate robust performance in handling dynamic and uncertain data, thus opening new avenues for future research in expanding the scope of graph-based analytics. This work lays the groundwork for future developments in uncer- tain graph processing, suggesting pathways to refine these approaches and explore new applications in dynamic environments.
  • Item
    Fuzzing OpenMP Compilers
    (University of Waterloo, 2024-07-08) Chang, Raymond
    OpenMP is a widely used API for parallel programming in C/C++ and Fortran. Its flexibility and simplicity have made its usage popular in many numerical or scientific applications. The prevalence of OpenMP programs in such important areas makes its respective compiler’s correctness significant. Unfortunately, OpenMP compilers are not tested as thoroughly as regular C/C++ compilers. More importantly, it is difficult to apply previous mutation-based testing techniques like EMI because of the parallelism in seed programs. This thesis introduces new fuzz testing approaches specifically for OpenMP compilers. For existing OpenMP programs, we de-parallelize and mutate them with dead code injection and false parallelization. We also transform existing regular C programs into OpenMP programs with template-based mutations. Two test suites were used for the evaluation, the OpenMP Offloading Validation & Verification Suite (SOLLVE VV) and programs generated from Csmith. For SOLLVE VV and with GCC and LLVM, the proposed techniques have been shown to increase coverage by at least 4.60% and 1.81% respectively. Compared to Csmith programs, coverage is improved by at least 3.90% for GCC and 1.85% for LLVM.
  • Item
    Eventually Durable State Machines
    (University of Waterloo, 2024-07-05) Kathuria, Kriti
    Typically, applications are designed to guarantee durability of the data they store. Durability is achieved by replicating client write requests to multiple machines. This replication adds to the time it takes for the application to respond to client requests. And so, latency-sensitive applications may implement ad-hoc mechanisms to circumvent durability costs, such as responding without replicating writes. Such ad-hoc mechanisms are hard to reason about and may leave the application in an inconsistent state. We propose Eventually Durable State Machine, a principled approach for applications to respond without waiting for replication to complete. The ED model offers fast response time but leaves applications vulnerable to data loss when failures occur. The ED model presents a strong ordering guarantee and a clear failure semantics to reason about the lost writes. Further, we develop ED Raft protocol, a derivative of the Raft consensus protocol, to implement the Eventually Durable State Machine. We describe the ED Raft and its key properties, and show that ED Raft supports the ED model.
  • Item
    Unsupervised Losses for Clustering and Segmentation of Images: Theories & Optimization Algorithms
    (University of Waterloo, 2024-07-03) Zhang, Zhongwen
    Unsupervised losses are common for tasks with limited human annotations. In clustering, they are used to group data without any labels. In semi-supervised or weakly-supervised learning, they are applied to the unannotated part of the training data. In self-supervised settings, they are used for representation learning. They appear in diverse forms enforcing different prior knowledge. However, formulating and optimizing such losses poses challenges. Firstly, translating prior knowledge into mathematical formulations can be non-trivial. Secondly, the properties of standard losses may not be obvious across different tasks. Thirdly, standard optimization algorithms may not work effectively or efficiently, thus requiring the development of customized algorithms. This thesis addresses several related classification and segmentation problems in computer vision, using unsupervised image- or pixel-level losses under a shortage of labels. First, we focus on the entropy-based decisiveness as a standard unsupervised loss for softmax models. While discussing it in the context of clustering, we prove that it leads to margin maximization, typically associated with supervised learning. In the context of weakly-supervised semantic segmentation, we combine decisiveness with the standard pairwise regularizer, the Potts model. We study the conceptual and empirical properties of different relaxations of the latter. For both clustering and segmentation problems, we provide new self-labeling optimization algorithms for the corresponding unsupervised losses. Unlike related prior work, we use soft hidden labels that can represent the estimated class uncertainty. Training network models with such soft pseudo-labels motivates a new form of cross-entropy maximizing the probability of “collision” between the predicted and estimated classes. The proposed losses and algorithms achieve the state-of-the-art on standard benchmarks. The thesis also introduces new geometrically motivated unsupervised losses for estimating thin structures, e.g. complex vasculature trees at near-capillary resolution in 3D medical data.
  • Item
    Memolet: Reifying the Reuse of User-AI Conversational Memories
    (University of Waterloo, 2024-06-19) Yen, Hen Chen
    As users engage more frequently with AI conversational agents, conversations may exceed their "memory" capacity, leading to failures in correctly leveraging certain memories for better responses. Therefore, users have to revisit related memories and re-provide these memories to the agents, ensuring that the generation refers to the accurate memories. However, the process of finding past memories to reuse is cumbersome, requiring users to retrieve related information across various conversations and articulate their intentions for reusing these memories to the AI. To support users in recalling and reusing relevant memories, we introduce Memolet, an interactive object that reifies memory reuse. Users can directly manipulate Memolet to specify which memories to reuse and how to use them. We developed a system demonstrating Memolet's interaction across various memory reuse stages, including memory extraction, organization, prompt articulation, and generation refinement. Through a user study, we gained insights into users' experiences with Memolet for memory reuse in AI conversations. The study validates the system's usefulness and provides design implications for future systems that support user-AI conversational memory reusing.
  • Item
    Improving the Precision of Analyses Queries in Factbase Models of Software Systems
    (University of Waterloo, 2024-05-31) Ke, Xiang Yun (Fa Fa)
    Large software systems are developed by multiple teams of software engineers, each working on different components that are supposed to work together. Each component is responsible for a subset of system functionality and the components communicate with each other to react to information received from hardware sensors and user inputs. It is often infeasible to perform manual code reviews on large software systems because the code base may be too large, the components may be written in different languages or language variants, or the concurrency of components can lead to a state explosion of the system's analysis space. To mitigate these challenges, we create a software model consisting of facts about the software system and perform analyses on the model. Analyses performed on these software models are not always sound and complete. One of the reasons is that the order of execution of facts in the model is unknown, leading to many false-positive results that refer to infeasible execution paths. Our work addresses this problem by extending a fact-based software model with control-flow-graph facts and associating existing facts with their corresponding control flow blocks. Then, the analyses are revised to check that results correspond to execution paths (in terms of control-flow-graph facts) before reporting results to the engineers. This extra execution-path check causes the revised analyses to exhibit significant performance overhead. To reduce the overall execution time of the analyses, we (1) stage analysis queries so that they work on a subset of the facts to generate partial results incrementally and (2) employ an on-the-fly execution path check that eliminates invalid sub-results within the analysis engine. Our work is evaluated with ten different analyses performed on six software systems that use the ROS (Robot Operating System) framework for cross-component communication. A detailed precision and performance evaluation was performed on Autonomoose and WISE-ADS, two ROS-based autonomous driving systems. In addition, this thesis adapts our approach to non-ROS systems (in which components communicate via function calls instead of passed messages) and we evaluate that work by analyzing a non-ROS software controller. The controller experiment is designed to test the scalability of our work when applied to large real-world applications.
  • Item
    Reliable WiFi Backscatter Communication in WiTAG
    (University of Waterloo, 2024-05-31) Adhikari, Manoj
    WiFi backscatter systems offer the potential to provide low-powered WiFi-compatible communication. This technology is especially promising when coupled with low-power sensors to periodically communicate readings from IoT devices. WiTAG is an extremely attractive approach because it greatly reduces power consumption by avoiding the use of WiFi receivers or signal detectors while ensuring compatibility with existing WiFi infrastructure. WiTAG operates at the MAC layer by corrupting or not corrupting subframes (MPDUs) within a transmitted frame (A-MPDU). For example, corruption of an MPDU signals a 0 and non-corruption signals a 1. Because it eschews the use of receivers and signal detectors WiTAG is unable to sense when frames are being sent by nearby WiFi devices that it relies on for communication. In this thesis, we describe the significant challenges that arise when formulating, transmitting, and reliably detecting and decoding messages transmitted from WiTAG. We design a message encoding framework to overcome these challenges. We show that although WiTAG relies on probabilities for overlapping a tag’s message with an A-MPDU it is possible to increase the odds of an overlap, thus increasing message rates. This permits the transmission of highly reliable messages in a relatively short period of time.
  • Item
    Meta-Solving via Machine Learning for Automated Reasoning
    (University of Waterloo, 2024-05-30) SCOTT, JOSEPH
    Automated reasoning (AR) and machine learning (ML) are two of the foundational pillars of artificial intelligence (AI) and yet have developed largely independently. The integration of these two sub-fields holds the tremendous potential to address problems that are otherwise difficult to solve, especially in the context of logic solvers, which are blackbox deductive reasoning engines designed to tackle NP-Hard problems. The early 2000s witnessed a `silent revolution' leading to the emergence of highly efficient boolean satisfiability (SAT), satisfiability modulo theories (SMT), and mixed-integer linear programming (MILP) solvers, capable of scaling to hundreds of millions of variables and being deployed billions of times daily in various industries. These advancements were primarily due to novel symbolic reasoning techniques as well as the use of ML in solvers. Building on previous successes, this thesis presents several advances in the use of ML in solvers. A particular way of characterizing the value of using ML in the context of automated reasoning tools is the following: under widely believed complexity-theoretic assumptions, we do not expect any one solver or even a fixed sequence of solvers to perform well on all classes of instances. In fact, there is considerable empirical support for the aforementioned observation. Hence, it is reasonable for us to research methods that enable solver users to adaptively select a (sequence of) solver(s) for any given instance. ML provides a promising means to realize such (adaptive) algorithm selection methods. We make the following contributions in this thesis: First, inspired by the success of the algorithm selection tool SATZilla for SAT solvers, we present the design and implementation of MachSMT, an algorithm selection tool for SMT solvers. MachSMT supports the entirety of the SMT-LIB and leverages ML over state-of-the-art SMT solvers. We provide empirical evidence for the value of algorithm selection and efficacy of MachSMT over three broad SMT usage scenarios, namely, solver selection for instances obtained from SMT-COMP (an annual competition for SMT solvers), configuration selection for a given solver (cvc5) over a large industrial benchmark suite, and finally for solver selection for a specific domain (network verification). Second, we present the design and implementation of a novel adaptive algorithm selection tool (aka, a {\it meta-solver}), called \goose, for neural network verification solvers, a class of tools aimed at improving the trustworthiness of ML systems. Traditional algorithm selection tools (e.g., MachSMT) typically tend to be non-adaptive, i.e., once a solver is selected for a given instance this selection is not changed at runtime. By contrast, a key novelty here is that \goose implements an {\it adaptive} sequential portfolio, i.e., it calls a set of subsolvers in a sequence, wherein the order in which subsolvers are called is determined adaptively based on information from their online and offline performance histories. We have implemented a variety of complete and incomplete subsolvers in \goose (in addition to using a set of off-the-shelf ones), and the following synergizing techniques to implement its adaptive sequential portfolio: algorithm selection, probabilistic satisfiability inference, and time-iterative deepening. Additionally, in the spirit of improving solver performance via ML techniques, we present \banditfuzz, an RL algorithm for relative performance fuzzing of solvers. While \machsmt and \goose leverage supervised learning to make solvers faster, \banditfuzz leverages RL to search for performance issues in solvers. \banditfuzz searches for short problem instances for which a set of target solvers is under-performant, while a set of reference solvers is performant. Such instances expose performance issues in solvers, and are often caused by solver developer errors (e.g., missing rewrite rules, errors in heuristics, etc.). We additionally introduce \pierce, a versatile and extensible testing tool aimed at solvers for the neural network verification (NNV) problem. At its core, \pierce implements a fuzzing engine over the Open Neural Network Exchange (ONNX) -- a standardized model format for deep learning and classical ML, and VNN-LIB -- a specification standard over the input-output behavior of ML systems. \pierce supports the entirety of the VNN-LIB and most of ONNX v18.
  • Item
    Symbolic Regression and Sequence Modelling with Conditional and Dynamic Language Models
    (University of Waterloo, 2024-05-30) Valipour, Mojtaba
    In an era where the boundaries of machine learning are continuously being pushed, this thesis presents two more advancements in the field of deep learning and artificial intelligence, with a focus on symbolic regression and dynamic training methodologies for neural networks. The first major contribution, SymbolicGPT, introduces a novel approach to symbolic regression using a transformer-based language model. This model significantly outperforms traditional methods by leveraging the strengths of probabilistic language models for improved accuracy and efficiency. The second theme of this thesis revolves around dynamic training methodologies, aimed at enhancing the adaptability and computational efficiency of neural networks under varying constraints. Within this framework, we introduce DyLoRA and SortedNet as key innovations. DyLoRA offers a dynamic, search-free low-rank adaptation technique, enabling models to adjust their complexity on-the-fly without extensive retraining. SortedNet proposes a generalized framework for embedding multiple neural network architectures within a single model, facilitating efficient model selection and adaptation. Extending SortedNet, SortedLLama applies these principles to large language models, demonstrating efficient dynamic inference capabilities.
  • Item
    Parallel Transaction Execution in Public Blockchain Systems
    (University of Waterloo, 2024-05-27) Shahid, Rizwan
    Public blockchain systems like Ethereum and Bitcoin suffer from poor transaction throughput, leading to delayed transaction execution and high transaction fees. They execute transactions one by one, failing to extract inherent parallelism possible in executing the workload. We present Block-X, a parallel transaction processing system with a serializable concurrency control that executes transactions in a block in a serializable order equivalent to the order of transactions in the block for public blockchains. It pre-executes transactions that are waiting to be added to a block. Through this pre-execution, Block-X estimates the keys a transaction wants to read or write. It uses this information to create a parallel execution schedule and run transactions optimistically in parallel following the schedule. It also uses the pre-execution to prefetch data that will be accessed during the critical path transaction execution. If a smart contract transaction accesses data outside of its initially estimated read-write set of keys, Block-X detects and resolves any potential conflicts. The final state is equivalent to the state produced after the sequential execution of transactions in the block order. Finally, Block-X also accelerates the process of validating blocks by providing the parallel execution schedule produced in the block execution step to validate transactions in parallel. We implemented our system on Ethereum so it is compatible with EVM chains. Our evaluation demonstrates that Block-X achieves up to a 2.3× higher throughput than Ethereum. Moreover, our performance is comparable to other systems that perform pessimistic execution. These systems require predefined read-write set and reject transactions that use data outside of it.
  • Item
    Writing My Own Line Drawing Software as an Artist
    (University of Waterloo, 2024-05-24) Philbrick, Greg
    I describe trying to improve my own art—line art, specifically—by developing computer science-based tools. The results of this experience are three technical contributions targeted at an NPR (Non-Photorealistic Rendering) research audience. The first is a formal definition of hatching, a traditional drawing technique. The second is the "hatching shape," a software primitive for rapidly performing hatching. The third is the invented problem of "mashing up" two drawings, along with an algorithm for solving this problem. In addition to these technical contributions, I relate how researching NPR changed me as an artist.
  • Item
    Deep Unsupervised Learning for Biodiversity Analyses: Representation learning and clustering of bacterial, mitochondrial, and barcode DNA sequences
    (University of Waterloo, 2024-05-22) Millan Arias, Pablo
    Amid the recent surge in next-generation sequencing technologies, alignment-free algorithms stand out as a promising alternative to traditional alignment-based methods in phylogenetic analyses. Specifically, the use of genomic signatures has enabled the success of supervised machine learning-based alignment-free methods in taxonomic classification. Motivated by this success, this dissertation investigates the potential of unsupervised learning-based alignment-free algorithms in genomic signature categorization. We conclude that meaningful information can be learned without reliance on labels, suggesting that supervision can be effectively eliminated from the learning process. First, we developed a Deep Learning-based Unsupervised Clustering method for DNA Sequences, DeLUCS. It trains a discriminative neural network to identify meaningful taxonomic clusters without supervision. In this process, we designed and conducted several proof-of-concept experiments to validate the effectiveness of our methodology in various datasets. Building on the contrastive nature of DeLUCS, we enhance it through self-supervised representation learning. We introduce $i$DeLUCS and its applicability in non-parametric clustering of DNA sequences, matching the performance of alignment-based and alignment-assisted clustering algorithms. In addition, we successfully apply unsupervised learning to categorize the genomic signatures of microbial extremophiles. We provide quantitative evidence suggesting that microbial extremophile genomes may contain information beyond ancestry or taxonomy. The evidence provided by our computational experiments led to the biological insight that a pervasive environmental component exists in the genomic signature of extremophilic organisms and could potentially redefine the concept of genomic signature. Finally, we introduce BarcodeBERT, a transformer-based encoder optimized for DNA barcodes. Since barcodes are short DNA fragments that contain enough information for the taxonomic identification of an organism, our model learns this taxonomy information and generates expressive embeddings that enable efficient classification of barcodes of novel specimens. We evaluate the quality of these embeddings through several downstream tasks, such as supervised fine-tuning and linear probing for species classification of known species and nearest neighbours probing for genus classification of unknown species. Additionally, the learned embeddings proved effective in a zero-shot classification framework for images of insects, underscoring the model's utility in integrating genomic and visual data for species identification. Our work attempts to connect the worlds of biodiversity and taxonomic identification with the world of deep unsupervised learning. Our findings reveal deep learning's untapped potential to capture taxonomic information, even without supervision. The methodologies presented in this dissertation can also be used to learn expressive DNA embeddings and test evolutionary hypotheses.
  • Item
    Multivariate Triangular Quantile Maps for Novelty Detection
    (University of Waterloo, 2024-05-21) Wang, Jingjing
    Novelty detection, a fundamental task in the field of machine learning, has drawn a lot of recent attention due to its wide-ranging applications and the rise of neural approaches. In this thesis, we present a general framework for neural novelty detection that centers around a multivariate extension of the univariate quantile function. Our general framework unifies and extends many classical and recent novelty detection algorithms, and opens the way to exploit recent advances in flow-based neural density estimation. We adapt the multiple gradient descent algorithm to obtain the first efficient end-to-end implementation of our framework that is free of tuning hyperparameters. Extensive experiments over a number of synthetic and real datasets confirm the efficacy of our proposed method against state-of-the-art alternatives.
  • Item
    Navigating Identities in Text: Towards an Approach for Dementia Care
    (University of Waterloo, 2024-05-21) Gano, Jess
    Identity, as a concept, is concerned with the social positioning of the self and the other. It manifests through discourse and interactions, and expressed in relation to other perceived identities. For example, can one be or talk as a leader without strictly categorizing those they interact with as subordinates or employees? Research shows that the onset and progression of dementia may undermine the individual's sense of self and identity. This loss of self or identity has not only been found to cause significant decrease in well-being, but also affect caregiver/care-recipient relationships. However, while identity is compromised in some way, it does not necessarily mean it is completely lost. Autobiographical stories, especially those told repeatedly, may serve as means to reveal significant aspects of the storyteller's self and identity. In this thesis, we explore the task of persona attribute extraction from dialogues as a proxy for identity cues. We define persona attribute as a triplet (s, r, o), where the relation r indicates the persona attribute type or relationship between the subject s and object o e.g., (I, has_hobby, knitting). Employing an information extraction approach, we design a two-stage persona attribute extractor, consisting of a relation predictor and entity extractor. Respectively, we define relation prediction as a multi-label classification task using BERT embeddings and feedforward neural networks, and entity extraction as a template infilling task following the pre-training objective of T5 (Raffel, 2020). We employ our methods on a proxy dataset created by combining Persona-Chat and Dialogue-NLI. Factoring ethical considerations and potential risks, directly evaluating our methods on a dementia use-case is not a feasible task. Therefore, we utilize a dataset consisting of interviews with older adults to assess feasibility within a context more closely resembling the dementia use-case. Exploring the research problem and developing our methodology highlights the following insights: (1) inferring identities from text, especially considering its nuanced representation in discourse, is challenging due to the abstract nature of identity itself and (2) to our knowledge, there is no available dataset that exhibits the distinct speech characteristics inherent in older adults making training and evaluating models tailored to this demographic very challenging. Furthermore, experiments on the older adults dataset show that a transfer learning approach to solving this problem is insufficient due to significant contrast between the datasets from the source and target domains.
  • Item
    Variability in Factors Influencing Pull Request Merge Decisions: A Microscopic Exploration
    (University of Waterloo, 2024-05-16) Ahmed, Nasif
    Context: The pull-based development model is a widely adopted practice in dis- tributed version control systems, particularly in open-source projects. In this model, con- tributors submit pull requests proposing changes to the codebase, which are then reviewed and potentially merged by project maintainers. Previous studies have extensively investi- gated the influence of different factors in merge outcome, aiming to generalize their impact across multiple projects. Objective: This thesis takes a unique approach by examining these factors at the project level, aiming to understand how the influence of each factor varies across projects. Methodology: To achieve this, we conducted a large-scale quantitative analysis on 841,399 pull requests from 1,100 GitHub projects. We constructed fixed-effect logistic regression models for each project and explored the correlations be- tween different factors and merge outcomes. Results: Our analysis indicates that the influence of factors varies across projects, both in terms of their order and direction. For example, while contributor experience is highly valued in many projects, it was found to be statistically insignificant in others. Likewise, the likelihood of a successful merge increases with the number of commits in some projects, whereas in others, it has the opposite effect. These findings have implications for both researchers and practitioners.
  • Item
    Explore the In-context Learning Capability of Large Language Models
    (University of Waterloo, 2024-05-10) Li, Tianle
    The rapid evolution of Large Language Models (LLMs) has marked the beginning of a new age in AI capabilities, particularly in the domain of natural language understanding and processing. Among the forefront of these advancements is the exploration of in-context learning, a paradigm that enables models to adapt to new tasks without explicit retraining. This thesis embarks on a comprehensive investigation into the in-context learning capabilities of LLMs, guided by two pivotal studies: KB-BINDER's deployment in Question Answering over Knowledge Bases (KBQA) and the evaluation of LLMs' performance on LongICLBench, a self-curated benchmark for long-context understanding. The first facet of this investigation, embodied by KB-BINDER, addresses the challenge of generalizing LLMs to diverse KBQA tasks without task-specific training. KB-BINDER pioneers a novel few-shot in-context learning approach, utilizing Codex to generate logical forms and employing BM25 for draft binding, demonstrating remarkable efficacy across heterogeneous KBQA datasets. We believe KB-BINDER can serve as an important baseline for future research in utilizing the few-shot capability of LLMs to resolve the problem of KBQA. Complementing this, the second study introduces LongICLBench, a specialized benchmark designed to test long-context LLMs in processing long, context-rich sequences across extreme-label classification tasks with in-context learning. Through evaluation with tasks of increasing difficulty level, an obvious performance threshold is identified, highlighting the current limitations of LLMs in handling extensive context windows and revealing a bias towards labels positioned towards the input's end after grouping the instances with the same labels in demonstration. This underscores a crucial gap in the current long-context LLMs' ability to reason over long sequences, paving the way for further enhancements in long-context comprehension. Together, these studies form the cornerstone of this thesis, encapsulating the dynamic landscape of in-context learning within LLMs. Through a detailed examination of KB-BINDER and LongICLBench, this work not only charts the current capabilities and boundaries of LLMs but also lays the groundwork for future advancements in making LLMs more adaptable and proficient in handling a wide array of complex tasks.