Improving Reliability for Networked Systems and Software Execution

Gu, Haoyu2026-05-272026-05-272026-05-272026-05-22https://hdl.handle.net/10012/23418Reliability is a fundamental requirement for modern software systems and services. As these systems grow larger and serve more users, even minor failures or outages can escalate into critical incidents. Reliability is a broad concept that covers the reliability design of many systems. When narrowed down, three areas still lack good solutions: connection failures in layer 4–7 network functions, software bug triage and diagnosis, and reliably reproducible software executions. This thesis introduces HA/TCP, AutoPecker, and PerfCheck to address each of these problems in turn. HA/TCP improves the reliability of networked systems. HA/TCP is the first framework to support the migration and failover of TCP-based layer 7 network functions (NFs) for reliability and multi-node scalability. HA/TCP does not modify the TCP protocol allowing existing projects to take advantage of HA/TCP without client changes. HA/TCP actively replicates traffic from primary node to all replica nodes to keep the state in sync. In the case of a node failure, HA/TCP enables replica NFs to take over connections in microseconds. Moreover, HA/TCP is completely transparent to the client, such that connection migration/failover is not visible to the client. AutoPecker provides a solution for automatic software bug triage and diagnosis. AutoPecker achieves the best of both worlds by combining a low overhead record/replay system with customized sanitizers and other instrumentation that can be enabled on replay. On a program crash, or by a manual invocation by the user, AutoPecker captures a trace of the program execution and tests it against a suite of sanitizers and programmer assertions. AutoPecker can run on the user’s or developer’s machine to automatically triage the bug and provide a detailed analysis. PerfCheck provides a comprehensive solution for improving the reliability and reproducibility of software execution. PerfCheck collects configurations and specifications of the host system, allowing developers or researchers to share their project together with a PerfCheck report as a configuration baseline. When other researchers attempt to reproduce the execution results, PerfCheck allows them to inspect and identify differences in their local environment configurations, ensuring reliable reproduction.enoperating systemsreliabilitymiddleboxesnetworkingprogramming languagesdebuggingreproducibilityImproving Reliability for Networked Systems and Software ExecutionDoctoral Thesis