Clustering Dependencies over Relational Tables
MetadataShow full item record
Integrity constraints have proven to be valuable in the database field. Not only can they help schema design (functional dependencies, FDs ), they can also be used in query optimization (ordering dependencies, ODs ), or data cleaning (conditional functional dependencies, CFDs  and denial constraints, DCs ). In this thesis, however, we will introduce a new type of integrity constraint, called a clustering dependency (CD). Similar to ordering dependencies which rely on the database operation ORDER BY, clustering dependencies focus on studying the operation GROUP BY. Furthermore, we claim that clustering dependencies are useful not only in query optimization as most integrity constraints do, but also useful in data visualization, data analysis and MapReduce. In this thesis, we first introduce some examples of clustering dependencies in a real-life dataset. We then formally define clustering dependencies and elaborate on our motivation. We will also look into the reasoning system for clustering dependencies including the implication problem, consistency problem and influence rules for clustering dependencies. After that, we will propose two algorithms for clustering dependencies, first a checking algorithm that is able to check if a given dependency is valid in a table within O(N*M) time, with N being the number of rows and M being the size of potentially aggregated attributes, a.k.a, the size of the right-hand-side attributes. Secondly, we propose a mining algorithm that is able to discover all potential clustering dependencies occurring in a table. Finally, we will use both synthetic and real-life data to test the performance of our mining algorithm.