Clustering Dependencies over Relational Tables

Gao, Yuchen

Clustering Dependencies over Relational Tables

Files

GAO_YUCHEN.pdf (1.3 MB)

Date

2016-01-22

Authors

Gao, Yuchen

Advisor

Weddell, Grant
Toman, David

Publisher

University of Waterloo

Abstract

Integrity constraints have proven to be valuable in the database field. Not only can they help schema design (functional dependencies, FDs [1][2]), they can also be used in query optimization (ordering dependencies, ODs [4][5][8][9]), or data cleaning (conditional functional dependencies, CFDs [12] and denial constraints, DCs [14]). In this thesis, however, we will introduce a new type of integrity constraint, called a clustering dependency (CD). Similar to ordering dependencies which rely on the database operation ORDER BY, clustering dependencies focus on studying the operation GROUP BY. Furthermore, we claim that clustering dependencies are useful not only in query optimization as most integrity constraints do, but also useful in data visualization, data analysis and MapReduce. In this thesis, we first introduce some examples of clustering dependencies in a real-life dataset. We then formally define clustering dependencies and elaborate on our motivation. We will also look into the reasoning system for clustering dependencies including the implication problem, consistency problem and influence rules for clustering dependencies. After that, we will propose two algorithms for clustering dependencies, first a checking algorithm that is able to check if a given dependency is valid in a table within O(N*M) time, with N being the number of rows and M being the size of potentially aggregated attributes, a.k.a, the size of the right-hand-side attributes. Secondly, we propose a mining algorithm that is able to discover all potential clustering dependencies occurring in a table. Finally, we will use both synthetic and real-life data to test the performance of our mining algorithm.

Keywords

Integrity Constraints, Database, Data Visualization, Query Optimization

URI

http://hdl.handle.net/10012/10219

Collections

Theses
Computer Science

Full item page

Clustering Dependencies over Relational Tables

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections