Optimizing Differential Computation for Large-Scale Graph Processing

Sahu, Siddhartha

Optimizing Differential Computation for Large-Scale Graph Processing

Files

sahu_siddhartha.pdf (1.58 MB)

Date

2024-04-01

Authors

Sahu, Siddhartha

Advisor

Salihoglu, Semh

Publisher

University of Waterloo

Abstract

Diverse applications spanning areas such as fraud detection, risk assessment, recommendations, and telecommunications process datasets characterized by entities and their relationships. Graphs naturally emerge as the most intuitive abstraction for modeling these datasets. Many practical applications seek the ability to share computations across multiple snapshots of evolving graphs to efficiently perform analyses, such as evaluating changing road conditions in transportation networks or performing contingency analysis on infrastructure networks. The research in this thesis is motivated by the challenge of efficiently supporting such applications on large datasets. Differential computation (DC) has emerged as a powerful general technique for incrementally maintaining computations over evolving datasets, even those containing arbitrarily nested loops. It is thus a promising technique that can be used to build the kinds of applications that motivate this thesis. We present a study of DC that explores how it can be used to build practical data systems. In particular, this thesis addresses two challenges that impede the adoption of DC: (i) the lack of high-level interfaces that can be used to develop graph-specific applications; and (ii) scalability challenges that arise due to the general maintenance technique used by DC, making it less efficient for application-specific workloads. The main contribution of this thesis is to show how DC can be made more practical for graph processing systems. To address the lack of high-level interfaces, we built GraphSurge, a system that can be used to create and analyze multiple views over static graphs using a declarative programming interface. When users perform graph computations on a collection of views, GraphSurge internally uses DC to share the computation across all views. We develop several optimizations that improve the scalability of DC. Within GraphSurge, we identify two optimization problems, which we call collection ordering and collection splitting, and present algorithms to solve these problems. These optimizations improve the runtime of GraphSurge applications by up to an order of magnitude. In the reference implementation of DC, we identify two design bottlenecks in how data is indexed and processed within operators. To address the bottlenecks, we implement a new index and an optimization called Fast Empty Difference Verification, which improves the runtime of graph processing workloads by up to 14x. Our work was informed by insights from a non-technical user survey we conducted to understand how graphs are used in practice.