Analytics for Everyone
Loading...
Date
2018-05-23
Authors
El Gebaly, Kareem
Advisor
Lin, Jimmy
Aboulnaga, Ashraf
Golab, Lukasz
Aboulnaga, Ashraf
Golab, Lukasz
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Analyzing relational data typically involves tasks that facilitate gaining familiarity or insights
and coming up with findings or conclusions based on the data. This process is usually practiced
by data experts, such as data scientists, who share their output with a potentially less expert
audience (everyone). Our goal is to enable everyone to participate in analyzing data rather than
passively consuming its outputs (analytics democratization). With today’s increasing availability
of data (data democratization) on the internet (web) combined with already widespread personal
computing capabilities such a goal is becoming more attainable. With the recent increase of
public data, i.e., Open Data, users without a technical background are keener than ever to analyze
new data sets that are relevant to wide sectors of society. An important example of Open Data is
the data released by governments all over the world, i.e., Open Government.
This dissertation focuses on two main challenges that would face data exploration scenarios
such as exploring open data found over the web. First, the infrastructure necessary for interactive
data exploration is costly and hard to manage, especially by users who do not have technical
knowledge. Second, the target users need guidance through the data exploration since there are
too many starting points.
To eliminate challenges related to managing infrastructure, we propose an in-browser SQL
engine (serverless), i.e., a portable database, which we call Afterburner. Afterburner achieves
comparable performance to native SQL engines given the same resources on modestly sized data
sets. Afterburner uses code generation techniques that target an optimization-amenable subset
of JavaScript and employs typed arrays for its columnar-based in-memory storage. In addition,
for databases that are too large for the browser, we propose a hybrid architecture to accelerate
the performance of data exploration tasks: a one-time SQL query that runs at the backend and
SQL queries running in the browser as per user’s interactions. Based on a simple hint by the
user, Afterburner automatically splits queries into two parts: a backend query that generates a
materialized view that is shipped to the browser, and a frontend query per subsequent interaction
occur locally against this view. Optimizing queries using local materialized views inside the
browser accelerates query latency without adding any complexity to the backend or the frontend.
One common theme among many data exploration tasks revolves around navigating the many
different ways to group the data, i.e., exploring the data cube. Thus, to guide the user through data
exploration, we apply an information-theoretic technique that picks the most informative parts
from the entire data cube of a relational table, which is called Explanation Tables. We evaluate the
efficiency and effectiveness of a sampling-based technique for generating explanation tables that
achieves comparable quality to an exhaustive technique that considers the entire data cube, with
a significant reduction in the run time. In addition, we introduce optimizations to explanation
tables to fit the modest resources available in the browser without any external dependencies.
In this, we present an SQL engine and a data exploration guidance tool that run entirely in
the browser. We view the techniques and the experiments presented here as a fully functional
and open-sourced proof of viability of our proposal. Our analytical stack is portable and works
entirely in the browser. We show that SQL and exploration guidance can be as accessible as a
web page, which opens the opportunity for more people to analyze data sets. Facilitating data
exploration for everyone is one step closer towards analytics democratization where everyone
can participate in data exploration, not just the experts.
Description
Keywords
analytics, SQL engine, code generation, Column-oriented, explanation tables, data exploration, informative, interpretable, javascript, browser, in-browser, mnemonic, open data, open government, RDBMS