Improving Reliability for Networked Systems and Software Execution

Gu, Haoyu

Improving Reliability for Networked Systems and Software Execution

Files

Gu_Haoyu.pdf (4.71 MB)

Date

2026-05-27

Authors

Gu, Haoyu

Advisor

Mashtizadeh, Ali José

Publisher

University of Waterloo

Abstract

Reliability is a fundamental requirement for modern software systems and services. As these systems grow larger and serve more users, even minor failures or outages can escalate into critical incidents. Reliability is a broad concept that covers the reliability design of many systems. When narrowed down, three areas still lack good solutions: connection failures in layer 4–7 network functions, software bug triage and diagnosis, and reliably reproducible software executions. This thesis introduces HA/TCP, AutoPecker, and PerfCheck to address each of these problems in turn. HA/TCP improves the reliability of networked systems. HA/TCP is the first framework to support the migration and failover of TCP-based layer 7 network functions (NFs) for reliability and multi-node scalability. HA/TCP does not modify the TCP protocol allowing existing projects to take advantage of HA/TCP without client changes. HA/TCP actively replicates traffic from primary node to all replica nodes to keep the state in sync. In the case of a node failure, HA/TCP enables replica NFs to take over connections in microseconds. Moreover, HA/TCP is completely transparent to the client, such that connection migration/failover is not visible to the client. AutoPecker provides a solution for automatic software bug triage and diagnosis. AutoPecker achieves the best of both worlds by combining a low overhead record/replay system with customized sanitizers and other instrumentation that can be enabled on replay. On a program crash, or by a manual invocation by the user, AutoPecker captures a trace of the program execution and tests it against a suite of sanitizers and programmer assertions. AutoPecker can run on the user’s or developer’s machine to automatically triage the bug and provide a detailed analysis. PerfCheck provides a comprehensive solution for improving the reliability and reproducibility of software execution. PerfCheck collects configurations and specifications of the host system, allowing developers or researchers to share their project together with a PerfCheck report as a configuration baseline. When other researchers attempt to reproduce the execution results, PerfCheck allows them to inspect and identify differences in their local environment configurations, ensuring reliable reproduction.