A Fault Injection Tool for Testing Distributed System with Network Faults

Loading...
Thumbnail Image

Date

2024-01-30

Authors

Khaleel, Seba

Advisor

Al-Kiswany, Samer

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Modern systems are complex, they include hundreds of components that implement complex protocols such as scheduling, replication, membership, resource management, client access, and security. These systems are expected to offer high availability and to preserve data stored in them despite environment faults. Testing is the primary approach for improving system reliability. Testing against environment faults such as hardware failures, memory corruption, and network problems is complicated since environment faults can happen at any time in the system lifetime, at any component, and at any step in a complex protocol. In this work, we focus on testing for network partitioning faults. We build PPATT, a fault injection testing tool that injects network partition faults between components. To reduce the number of test scenarios that need to be considered, we implement two techniques to focus testing on components that communicate during an operation. We verify the tool through reproducing four catastrophic failures from two widely popular systems: Spark and Kafka. To demonstrate the benefits of our system, we test three systems using PPATT: Flink, Hazelcast, and ActiveMQ Artemis. Our testing discovers three failures in these systems. All these failures are due to design flaws.

Description

Keywords

Network Partitions, Fault Injection, Distributed Systems

LC Keywords

Citation