A Fault Injection Tool for Testing Distributed System with Network Faults
Loading...
Date
2024-01-30
Authors
Khaleel, Seba
Advisor
Al-Kiswany, Samer
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Modern systems are complex, they include hundreds of components that implement complex protocols such as scheduling, replication, membership, resource management, client access, and security. These systems are expected to offer high availability and to preserve data stored in them despite environment faults. Testing is the primary approach for improving system reliability. Testing against environment faults such as hardware failures, memory corruption, and network problems is complicated since environment faults can happen at any time in the system lifetime, at any component, and at any step in a complex protocol.
In this work, we focus on testing for network partitioning faults.
We build PPATT, a fault injection testing tool that injects network partition faults between components. To reduce the number of test scenarios that need to be considered, we implement two techniques to focus testing on components that communicate during an operation. We verify the tool through reproducing four catastrophic failures from two widely popular systems: Spark and Kafka. To demonstrate the benefits of our system, we test three systems using PPATT: Flink, Hazelcast, and ActiveMQ Artemis. Our testing discovers three failures in these systems. All these failures are due to design flaws.
Description
Keywords
Network Partitions, Fault Injection, Distributed Systems