Reducing the Latency of Dependent Operations in Large-Scale Geo-Distributed Systems
Loading...
Date
2021-10-14
Authors
Yan, Xinan
Advisor
Wong, Bernard
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Many applications rely on large-scale distributed systems for data management and computation. These distributed systems are complex and built from different networked services. Dependencies between these services can create a chain of dependent network I/O operations that have to be executed sequentially. This can result in high service latencies, especially when the chain consists of inter-datacenter operations.
To address the latency problem of executing dependent network I/O operations, this thesis introduces new approaches and techniques to reduce the required number of operations that have to be executed sequentially for three system types. First, it addresses the high transaction completion time in geo-distributed database systems that have data sharded and replicated across different geographical regions. For a single transaction, most existing systems sequentially execute reads, writes, 2PC, and a replication protocol because of dependencies between these parts. This thesis looks at using a more restrictive transaction model in order to break dependencies and allow different parts to execute in parallel.
Second, dependent network I/O operations also lead to high latency for performing leader-based state machine replication across a wide-area network. Fast Paxos introduces a fast path that bypasses the leader for request ordering. However, when concurrent requests arrive at replicas in different orders, the fast path may fail, and Fast Paxos has to fall back to a slow path. This thesis explores the use of network measurements to establish a global order for requests across replicas, allowing Fast Paxos to be effective for concurrent requests.
Finally, this thesis proposes a general solution to reduce the latency impact of dependent operations in distributed systems through the use of speculative execution. For many systems, domain knowledge can be used to predict an operation’s result and speculatively execute subsequent operations, potentially allowing a chain of dependent operations to execute in parallel. This thesis introduces a framework that provides system-level support for performing speculative network I/O operations.
These three approaches reduce the number of sequentially performed network I/O operations in different domains. Our performance evaluation shows that they can significantly reduce the latency of critical infrastructure services, allowing these services to be used by latency-sensitive applications.
Description
Keywords
distributed systems, dependent operations, distributed transaction processing, state machine replication, speculative execution