Efficient Structure-aware OLAP Query Processing over Large Property Graphs

Zhang, Yan

Efficient Structure-aware OLAP Query Processing over Large Property Graphs

Files

Zhang_Yan.pdf (18.35 MB)

Date

2017-12-14

Authors

Zhang, Yan

Advisor

Özsu, Tamer

Publisher

University of Waterloo

Abstract

Property graph model is a semantically rich model for real-world applications that represent their data as graphs, e.g., communication networks, social networks, financial transaction networks. On-Line Analytical Processing (OLAP) provides an important tool for data analysis by allowing users to perform data aggregation through different combinations of dimensions. For example, given a Q&A forum dataset, in order to study if there is a correlation between a poster's age and his or her post quality, one may ask what is the average age of users grouped by the post score. Another example is that, in the field of music industry, it may be interesting to ask what total sales of records are with respect to different music companies and years so as to conduct a market activity analysis. Surprisingly, current graph databases do not efficiently support OLAP aggregation queries. In most cases, such queries are transformed to a sequence of join operations, and the system computes everything from scratch. For example, Neo4j, a state-of-art graph database system, processes each OLAP query in two steps. First, it expands the nodes and edges that satisfy the given query constraint. Then it performs the aggregation over all the valid substructures returned from the first step. However, in data warehousing workloads, it is common to have repeated queries from time to time. Computing everything from scratch would be highly inefficient. Materialization and view maintenance techniques developed in traditional RDBMS have proved to be efficient for processing OLAP workloads. Following the generic materialization methodology, in this thesis we develop a structure-aware cuboid caching solution to efficiently support OLAP aggregation queries over property graphs. Structure-aware means that our solution takes both heterogeneous attributes and graph topological information into consideration. The essential idea is to precompute and materialize some views based on statistics of history workload, such that future query processing can be accelerated. We implement a prototype system on top of Neo4j. Empirical studies over real-world property graphs show that, with a reasonable space cost constraint, our solution on average achieves 15-30x speedup over native Neo4j in time efficiency.