Improving Reliability for Networked Systems and Software Execution

dc.contributor.authorGu, Haoyu
dc.date.accessioned2026-05-27T19:37:02Z
dc.date.available2026-05-27T19:37:02Z
dc.date.issued2026-05-27
dc.date.submitted2026-05-22
dc.description.abstractReliability is a fundamental requirement for modern software systems and services. As these systems grow larger and serve more users, even minor failures or outages can escalate into critical incidents. Reliability is a broad concept that covers the reliability design of many systems. When narrowed down, three areas still lack good solutions: connection failures in layer 4–7 network functions, software bug triage and diagnosis, and reliably reproducible software executions. This thesis introduces HA/TCP, AutoPecker, and PerfCheck to address each of these problems in turn. HA/TCP improves the reliability of networked systems. HA/TCP is the first framework to support the migration and failover of TCP-based layer 7 network functions (NFs) for reliability and multi-node scalability. HA/TCP does not modify the TCP protocol allowing existing projects to take advantage of HA/TCP without client changes. HA/TCP actively replicates traffic from primary node to all replica nodes to keep the state in sync. In the case of a node failure, HA/TCP enables replica NFs to take over connections in microseconds. Moreover, HA/TCP is completely transparent to the client, such that connection migration/failover is not visible to the client. AutoPecker provides a solution for automatic software bug triage and diagnosis. AutoPecker achieves the best of both worlds by combining a low overhead record/replay system with customized sanitizers and other instrumentation that can be enabled on replay. On a program crash, or by a manual invocation by the user, AutoPecker captures a trace of the program execution and tests it against a suite of sanitizers and programmer assertions. AutoPecker can run on the user’s or developer’s machine to automatically triage the bug and provide a detailed analysis. PerfCheck provides a comprehensive solution for improving the reliability and reproducibility of software execution. PerfCheck collects configurations and specifications of the host system, allowing developers or researchers to share their project together with a PerfCheck report as a configuration baseline. When other researchers attempt to reproduce the execution results, PerfCheck allows them to inspect and identify differences in their local environment configurations, ensuring reliable reproduction.
dc.identifier.urihttps://hdl.handle.net/10012/23418
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectoperating systems
dc.subjectreliability
dc.subjectmiddleboxes
dc.subjectnetworking
dc.subjectprogramming languages
dc.subjectdebugging
dc.subjectreproducibility
dc.titleImproving Reliability for Networked Systems and Software Execution
dc.typeDoctoral Thesis
uws-etd.degreeDoctor of Philosophy
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorMashtizadeh, Ali José
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Gu_Haoyu.pdf
Size:
4.71 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections