Mathematics (Faculty of)

Permanent URI for this communityhttps://uwspace.uwaterloo.ca/handle/10012/9924

Welcome to the Faculty of Mathematics community.

This community and it’s collections are organized using the University of Waterloo's Faculties and Academics structure. In this structure:

Communities are Faculties or Affiliated Institutions
Collections are Departments or Research Centres

Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).

New collections following this structure will be created UPON REQUEST.

Browse

Now showing 1 - 16 of 16

A Study of the Opportunities and Challenges of Using Edge Computing to Accelerate Cloud Applications
(University of Waterloo, 2025-02-18) Qadi, Hala; Al-Kiswany, Samer
I explore the viability of using edge clusters to host latency-sensitive applications and to run services that can improve end-to-end communication performance across both wide area networks (WANs) and 5G environments. The study examines the viability of using edge clusters in three scenarios: accelerating TCP communications through TCP splitting in 5G deployments, hosting an entire application-level service or the latency-sensitive part of an application on an edge cluster, and deploying a TCP splitting service on edge clusters to support WAN communication. I explore these scenarios while varying packet drop rates, communication stacks, congestion control protocols, and TCP buffer sizes. My findings bring new insights about these deployment scenarios. I show that edge computing, especially through TCP splitting, can significantly improve end-to-end communication performance over the classical communication stack. TCP splitting over the 5G communication stack does not bring any benefit and can reduce throughput. This is because of the unique characteristics of the 5G communication stack. Furthermore, over the classical communication stack, TCP splitting brings higher benefit for flows larger than 64 KB. These findings provide valuable insights into how edge clusters can accelerate TCP communication in different network environments and identify high-impact research ideas for future work.
An Analysis of Network-Partitioning Failures in Cloud Systems
(University of Waterloo, 2018-12-10) Alquraan, Ahmed; Al-Kiswany, Samer
We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms. We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures.
An Analysis of Partial Network Partitioning Failures in Modern Distributed Systems
(University of Waterloo, 2020-01-03) Alfatafta, Mohammed; Al-Kiswany, Samer
We present a comprehensive study of system failures from 12 popular systems caused by a peculiar type of network partitioning faults: partial partitions. Partial partitions isolate a set of nodes from some, but not all, nodes in the cluster. Our study reveals the studied failures are catastrophic; they lead to data loss, complete system unavailability, or stale and dirty reads. Furthermore, our study reveals that these failures are easy to manifest, they are deterministic, they can be triggered by isolating a single node, and without any interaction with the system’s clients. We dissected the implemented fault tolerance techniques in eight popular systems. We identified four principled approaches for building a fault tolerance mechanism for partial partitions and identified the shortcomings of the current approaches. The currently implemented fault tolerance techniques are either specific to a particular protocol or implementation or may lead to a complete cluster shut down despite the availability of alternative network paths between the nodes. Finally, we present NIFTY, a generic communication layer that leverages the capabilities of modern software-defined networking to monitor and recover the connectivity of the cluster in case of partial network partitions. NIFTY is transparent to the application running on top of it. We built NiftyDB, a database system atop NIFTY. NiftyDB implements a set of optimizations that reduce the network overhead and operation latency in case of partial network partitioning. Our analysis and evaluation show that the proposed approach can effectively mask partial network partitioning faults without incurring additional overheads.
An Empirical Evaluation of the Viability of the Serverless Paradigm for Scientific Workflows
(University of Waterloo, 2023-12-22) Elshamy, Abdallah; Al-Kiswany, Samer
Scientific workflows are typically data-intensive. They consist of many stages, each of which may contain hundreds to even thousands of tasks. Traditionally, scientific workflows have been executed using the serverful computing model. Serverless computing presents an attractive alternative to the serverful computing model as it frees developers from managing and provisioning resources and offers a fine-grained billing model. In this work, we study the viability and evaluate the trade-offs of using the serverless paradigm to run scientific workflows. We follow an empirical approach to evaluate the performance and cost benefits of this paradigm and to study the suitability of the current serverless software stack to support complex data-intensive scientific workflows. Specifically, we discuss, implement, and evaluate three orchestration approaches for executing scientific workflows: serverful-centralized, serverless-centralized, and serverless-decentralized. This work is the first to implement and evaluate a purely serverless orchestration approach that does not require deploying a dedicated workflow manager for scientific workflows. Our evaluation shows that serverless orchestration approaches cause a noticeable performance overhead for some workflow patterns (e.g., reduce stages) due to accessing a large amount of remote data. We propose two optimizations (i.e., prefetching file privileges and container placement) that exploit data locality to mitigate that impact. Our evaluation with three scientific workflows—Montage, 1000Genomes, and SRA Search—shows that serverless-centralized and serverless-decentralized achieve a comparable performance to a serverful approach. Also, our results show that prefetching file privileges and container placement optimizations improve the performance by 32.6% and 44% respectively when compared to an unoptimized version for the Montage application. We also introduce a cost model to estimate which approach is cheaper for a given application and a cloud provider. Our cost analysis shows that answering this question depends on the characteristics of the workflow and the pricing of the cloud provider.
A Fault Injection Tool for Testing Distributed System with Network Faults
(University of Waterloo, 2024-01-30) Khaleel, Seba; Al-Kiswany, Samer
Modern systems are complex, they include hundreds of components that implement complex protocols such as scheduling, replication, membership, resource management, client access, and security. These systems are expected to offer high availability and to preserve data stored in them despite environment faults. Testing is the primary approach for improving system reliability. Testing against environment faults such as hardware failures, memory corruption, and network problems is complicated since environment faults can happen at any time in the system lifetime, at any component, and at any step in a complex protocol. In this work, we focus on testing for network partitioning faults. We build PPATT, a fault injection testing tool that injects network partition faults between components. To reduce the number of test scenarios that need to be considered, we implement two techniques to focus testing on components that communicate during an operation. We verify the tool through reproducing four catastrophic failures from two widely popular systems: Spark and Kafka. To demonstrate the benefits of our system, we test three systems using PPATT: Flink, Hazelcast, and ActiveMQ Artemis. Our testing discovers three failures in these systems. All these failures are due to design flaws.
Improving Cluster Scheduling Resiliency to Network Faults
(University of Waterloo, 2023-05-31) Qunaibi, Sara; Al-Kiswany, Samer
We present a comprehensive empirical study of the impact partial network partitions have on cluster managers in data analysis frameworks. Our study shows that modern scheduling approaches are vulnerable to partial network partitions. Partial partitions can lead to a complete cluster pause or a significant loss of performance. To overcome the shortcoming of the state-of-the-art schedulers, we design the topology-aware scheduler (TAS). TAS incorporates the current network connectivity information when making a scheduling decision, to allocate fully connected nodes for a given application. TAS effectively hides partial partitions from applications. Our evaluation of a TAS prototype shows that it can tolerate partial network partitions, eliminate application halting or significant loss of performance.
In-Network Scheduling for Real-Time Analytics
(University of Waterloo, 2021-04-30) Udayashankar, Sreeharsha; Al-Kiswany, Samer
This thesis presents Bolt, a novel scheduler design for large-scale real-time data analytics. Bolt achieves the scheduling accuracy of modern centralized schedulers while supporting clusters with hundreds of thousands of nodes. At Bolt’s core is a scheduler design that leverages modern programmable switches. Bolt supports a FIFO scheduling policy, as well as task priority-based and task resource constraint-based scheduling policies. Evaluation of a Bolt prototype on our cluster with a Barefoot Tofino switch shows that the proposed approach can reduce scheduling overhead by 40x and increase the scheduling throughput by 50x compared to state-of-the-art centralized and decentralized schedulers.
Leveraging Software-Defined Networking to Mask Partial Network Partitions
(University of Waterloo, 2021-08-11) Alkhatib, Basil; Al-Kiswany, Samer
We present an extensive study focused on partial network partitioning. Partial network partitions disrupt the communication between some but not all nodes in a cluster. First, we conduct a comprehensive study of system failures caused by this fault in 13 popular systems. Our study reveals that the studied failures are catastrophic (e.g., lead to data loss), easily manifest, and are mainly due to design flaws. Our analysis identifies vulnerabilities in core systems mechanisms including scheduling, membership management, and ZooKeeper-based configuration management. Second, we dissect the design of nine popular systems and identify four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault tolerance techniques are inadequate for modern systems; they either patch a particular mechanism or lead to a complete cluster shutdown, even when alternative network paths exist. Finally, our findings motivate us to build Nifty, a transparent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Nifty provides an approach for applications to optimize their operation during a partial partition. We demonstrate the benefit of this approach through integrating Nifty with VoltDB and HDFS.
Measuring the Performance of Code Produced with GitHub Copilot
(University of Waterloo, 2022-12-22) Erhabor, Daniel; Al-Kiswany, Samer; Nagappan, Meiyappan
GitHub Copilot is an artificially intelligent programming assistant used by many developers. While a few studies have evaluated the security risks of using Copilot, there has not been any study to show if it aids developers in producing code with better performance. We evaluate the performance of code produced when developers use GitHub Copilot versus when they do not. To this end, we conducted a user study with 32 participants where each participant solved two C++ programming problems, one with Copilot and the other without it and measured the running time of the participants' solutions on our test data. Our results suggest that using Copilot can produce code with a significantly slower running time.
MECBench: A Framework for Benchmarking Multi-Edge Computing Systems
(University of Waterloo, 2023-01-27) Naman, Omar; Al-Kiswany, Samer
I present MECBench, an extensible benchmarking framework for multi-access edge computing. MECBench is configurable and can emulate networks with different capabilities and conditions, can scale the generated workloads to mimic large number of clients, and can generate a range of workload patterns. MECBench is extensible; it can be extended to change the generated workload, use new datasets, and integrate new applications. MECBench’s implementation includes machine learning and synthetic edge applications. I demonstrate MECBench’s capabilities through three scenarios: an object detection processing for drone navigation, a natural language processing application, and a synthetic workload with configurable compute and I/O intensity. My evaluation shows that MECBench can be used to answer complex what-if questions pertaining to design and deployment decisions of MEC platforms and applications. My evaluation explores the impact of different combinations of applications, hardware, and network conditions as well as the cost-benefit tradeoff of different designs and configurations.
A Network Integrated Design for Micro-scale Cloud Services
(University of Waterloo, 2022-05-25) Abdel-hadi, Ashraf; Al-Kiswany, Samer
We present Falcon, network-integrated scheduler for micro-scale services. Falcon follows a centralized scheduler design to achieve high scheduling efficiency and leverages modern programmable switches to lower the scheduling latency and increase the scheduling throughput. Falcon supports multiple scheduling policies such as FIFO, and data locality aware policies. Our empirical evaluation shows that Falcon reduces scheduling latency by 120 times and increases the scheduling throughput by 100 times compared to state-of-the-art schedulers.
Network-Accelerated Linearizable Reads
(University of Waterloo, 2019-08-07) Takruri, Hatem; Al-Kiswany, Samer
This thesis present FLAIR, a novel approach for accelerating read operations in leader-based consensus protocols. FLAIR leverages the capabilities of the new generation of programmable switches to serve reads from follower replicas without compromising consistency. The core of the new approach is a packet-processing pipeline that can track client requests and system replies, identify consistent replicas, and at line speed, forward read requests to replicas that can serve the read without sacrificing linearizability. An additional benefit of FLAIR is that it facilitates devising novel consistency-aware load balancing techniques. Following the new approach, we designed FlairKV, a key-value store atop Raft. FlairKV implements the processing pipeline using the P4 programming language. We evaluate the benefits of the proposed approach and compare it to previous approaches using a cluster with a Barefoot Tofino switch. The evaluation indicates that the proposed approach can bring significant performance gains: up to 43% higher throughput and 35-97% lower latency for most workloads compared to state-of-the-art alternatives.
Optimizing MPI Collective Operations for Cloud Deployments
(University of Waterloo, 2020-01-24) AlSader, Zuhair; Al-Kiswany, Samer; Brecht, Tim
Cloud infrastructures are increasingly being adopted as a platform for high performance computing (HPC) science and engineering applications. For HPC applications, the Message-Passing Interface (MPI) is widely-used. Among MPI operations, collective operations are the most I/O intensive and performance critical. However, classical MPI implementations are inefficient on cloud infrastructures because they are implemented at the application layer using network-oblivious communication patterns. These patterns do not differentiate between local or cross-rack communication and hence do not exploit the inherent locality between processes collocated on the same node or the same rack of nodes. Consequently, they can suffer from high network overheads when communicating across racks. In this thesis, we present COOL, a simple and generic approach for Message-Passing Interface (MPI) collective operations. COOL enables highly efficient designs for collective operations in the cloud. We then present a system design based on COOL that describes how to implement frequently used collective operations. Our design efficiently uses the intra-rack network while significantly reducing cross-rack communication, thus improving application performance and scalability. We use software-defined networking capabilities to build more efficient network paths for I/O intensive collective operations. Our analytic evaluation shows that our design significantly reduces the network overhead across racks. Furthermore, when compared with OpenMPI and MPICH, our design reduces the latency of collective operations by a factor of log N, where N is the total number of processes, decreases the number of exchanged messages by a factor of N and reduces the network load by up to an order of magnitude. These significant improvements come at the cost of a small increase in the computation load on a few processes.
Parallel Transaction Execution in Public Blockchain Systems
(University of Waterloo, 2024-05-27) Shahid, Rizwan; Wong, Bernard; Al-Kiswany, Samer
Public blockchain systems like Ethereum and Bitcoin suffer from poor transaction throughput, leading to delayed transaction execution and high transaction fees. They execute transactions one by one, failing to extract inherent parallelism possible in executing the workload. We present Block-X, a parallel transaction processing system with a serializable concurrency control that executes transactions in a block in a serializable order equivalent to the order of transactions in the block for public blockchains. It pre-executes transactions that are waiting to be added to a block. Through this pre-execution, Block-X estimates the keys a transaction wants to read or write. It uses this information to create a parallel execution schedule and run transactions optimistically in parallel following the schedule. It also uses the pre-execution to prefetch data that will be accessed during the critical path transaction execution. If a smart contract transaction accesses data outside of its initially estimated read-write set of keys, Block-X detects and resolves any potential conflicts. The final state is equivalent to the state produced after the sequential execution of transactions in the block order. Finally, Block-X also accelerates the process of validating blocks by providing the parallel execution schedule produced in the block execution step to validate transactions in parallel. We implemented our system on Ethereum so it is compatible with EVM chains. Our evaluation demonstrates that Block-X achieves up to a 2.3× higher throughput than Ethereum. Moreover, our performance is comparable to other systems that perform pessimistic execution. These systems require predefined read-write set and reject transactions that use data outside of it.
Soteria: An Approach for Detecting Multi-Institution Attacks
(University of Waterloo, 2023-01-02) Zabarah, Saif; Boutaba, Raouf; Al-Kiswany, Samer
We present Soteria, a data processing pipeline for detecting multi-institution attacks. Multi-institution attacks contact large number of potential targets looking for vulnerabilities that span multiple institutions. Soteria uses a set of Machine Learning techniques to detect future attacks, predict their future targets, and ranks attacks based on their predicted severity. Our evaluation with real data from Canada wide institutions networks shows that Soteria can predict future attacks with 95% recall rate, predict the next targets of an attack with 97% recall rate, and can detect attacks in the first 20% of their life span. Soteria is deployed in production at CANARIE Canada wide network that connects tens of Canadian academic institutions.
A Study of the Capabilities of Message-Oriented Middleware Systems
(University of Waterloo, 2023-01-27) Al-Manasrah, Wael; Al-Kiswany, Samer
We present a comprehensive characterization study of open-source Message-Oriented Middleware (MOM) systems. We devised a rigorous methodology to select and study 10 popular and diverse MOM systems. For each system, we examine 42 features with a total of 134 different options. We found that MOM systems have evolved to provide a framework for modern cloud applications through high flexibility and configurability and by offering core building blocks for complex applications including transaction support, active messaging, resource management, flow control, and native support for multi-tenancy. A key result of our study, is that we believe there is an opportunity for the community to consolidate its efforts on fewer open-source projects. We have also created an annotated data set that makes it easy to verify our findings, which can also be used to help practitioners and developers determine and understand the features of different systems. For a wider impact, our data set is publicly available at [https://docs.google.com/spreadsheets/d/1HrZ7ub19FuuBzA5z4aA6RfR5vnkdnm0bg3hxfADspEA/edit?usp=sharing].

Browse

Browsing Mathematics (Faculty of) by Author "Al-Kiswany, Samer"