|dc.description.abstract||Data analytics is being widely used not only as a business tool, which empowers organizations to drive efficiencies, glean deeper operational insights and identify new opportunities, but also for the greater good of society, as it is helping solve some of world's most pressing issues, such as developing COVID-19 vaccines, fighting poverty and climate change. Data analytics is a process involving a pipeline of tasks over the underlying datasets, such as data acquisition and cleaning, data exploration and profiling, building statistics and training machine learning models. In many cases, conducting data analytics faces two practical challenges. First, many sensitive datasets have restricted access and do not allow unfettered access; Second, data assets are often owned and stored in silos by multiple business units within an organization with different access control. Therefore, data scientists have to do analytics on private and siloed data.
There is a fundamental trade-off between data privacy and the data analytics tasks. On the one hand, achieving good quality data analytics requires understanding the whole picture of the data; on the other hand, despite recent advances in designing privacy and security primitives such as differential privacy and secure computation, when naivly applied, they often significantly downgrade tasks' efficiency and accuracy, due to the expensive computations and injected noise, respectively. Moreover, those techniques are often piecemeal and they fall short in holistically integrating into end-to-end data analytics tasks.
In this thesis, we approach this problem by treating privacy and utility as constraints on data analytics. First, we study each task and express its utility as data constraints; then, we select a principled data privacy and security model for each task; and finally, we develop mechanisms to combine them into end to end analytics tasks. This dissertation addresses the specific technical challenges of trading off privacy and utility in three popular analytics tasks.
The first challenge is to ensure query accuracy in private data exploration. Current systems for answering queries with differential privacy place an inordinate burden on the data scientist to understand differential privacy, manage their privacy budget, and even implement new algorithms for noisy query answering. Moreover, current systems do not provide any guarantees to the data analyst on the quality they care about, namely accuracy of query answers. We propose APEx, a generic accuracy-aware privacy query engine for private data exploration. The key distinction of APEx is to allow the data scientist to explicitly specify the desired accuracy bounds to a SQL query. Using experiments with query benchmarks and a case study, we show that APEx allows high exploration quality with a reasonable privacy loss.
The second challenge is to preserve the structure of the data in private data synthesis. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data — the underlying correlations and dependencies among tuples and attributes. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved. We propose Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis.
The third challenge is efficient and secure private data profiling. Discovering functional dependencies (FDs) usually requires access to all data partitions to find constraints that hold on the whole dataset. Simply applying general secure multi-party computation protocols incurs high computation and communication cost. We propose SMFD to formulate the FD discovery problem in the secure multi-party scenario, and design secure and efficient cryptographic protocols to discover FDs over distributed partitions. Experimental results show that SMFD is practically efficient over non-secure distributed FD discovery, and can significantly outperform general purpose multi-party computation frameworks||en