Improving Reliability for Networked Systems and Software Execution
| dc.contributor.author | Gu, Haoyu | |
| dc.date.accessioned | 2026-05-27T19:37:02Z | |
| dc.date.available | 2026-05-27T19:37:02Z | |
| dc.date.issued | 2026-05-27 | |
| dc.date.submitted | 2026-05-22 | |
| dc.description.abstract | Reliability is a fundamental requirement for modern software systems and services. As these systems grow larger and serve more users, even minor failures or outages can escalate into critical incidents. Reliability is a broad concept that covers the reliability design of many systems. When narrowed down, three areas still lack good solutions: connection failures in layer 4–7 network functions, software bug triage and diagnosis, and reliably reproducible software executions. This thesis introduces HA/TCP, AutoPecker, and PerfCheck to address each of these problems in turn. HA/TCP improves the reliability of networked systems. HA/TCP is the first framework to support the migration and failover of TCP-based layer 7 network functions (NFs) for reliability and multi-node scalability. HA/TCP does not modify the TCP protocol allowing existing projects to take advantage of HA/TCP without client changes. HA/TCP actively replicates traffic from primary node to all replica nodes to keep the state in sync. In the case of a node failure, HA/TCP enables replica NFs to take over connections in microseconds. Moreover, HA/TCP is completely transparent to the client, such that connection migration/failover is not visible to the client. AutoPecker provides a solution for automatic software bug triage and diagnosis. AutoPecker achieves the best of both worlds by combining a low overhead record/replay system with customized sanitizers and other instrumentation that can be enabled on replay. On a program crash, or by a manual invocation by the user, AutoPecker captures a trace of the program execution and tests it against a suite of sanitizers and programmer assertions. AutoPecker can run on the user’s or developer’s machine to automatically triage the bug and provide a detailed analysis. PerfCheck provides a comprehensive solution for improving the reliability and reproducibility of software execution. PerfCheck collects configurations and specifications of the host system, allowing developers or researchers to share their project together with a PerfCheck report as a configuration baseline. When other researchers attempt to reproduce the execution results, PerfCheck allows them to inspect and identify differences in their local environment configurations, ensuring reliable reproduction. | |
| dc.identifier.uri | https://hdl.handle.net/10012/23418 | |
| dc.language.iso | en | |
| dc.pending | false | |
| dc.publisher | University of Waterloo | en |
| dc.subject | operating systems | |
| dc.subject | reliability | |
| dc.subject | middleboxes | |
| dc.subject | networking | |
| dc.subject | programming languages | |
| dc.subject | debugging | |
| dc.subject | reproducibility | |
| dc.title | Improving Reliability for Networked Systems and Software Execution | |
| dc.type | Doctoral Thesis | |
| uws-etd.degree | Doctor of Philosophy | |
| uws-etd.degree.department | David R. Cheriton School of Computer Science | |
| uws-etd.degree.discipline | Computer Science | |
| uws-etd.degree.grantor | University of Waterloo | en |
| uws-etd.embargo.terms | 0 | |
| uws.contributor.advisor | Mashtizadeh, Ali José | |
| uws.contributor.affiliation1 | Faculty of Mathematics | |
| uws.peerReviewStatus | Unreviewed | en |
| uws.published.city | Waterloo | en |
| uws.published.country | Canada | en |
| uws.published.province | Ontario | en |
| uws.scholarLevel | Graduate | en |
| uws.typeOfResource | Text | en |