Clustering Dependencies over Relational Tables

dc.contributor.advisorWeddell, Grant
dc.contributor.advisorToman, David
dc.contributor.authorGao, Yuchen
dc.date.accessioned2016-01-22T21:20:59Z
dc.date.available2016-01-22T21:20:59Z
dc.date.issued2016-01-22
dc.date.submitted2016-01-18
dc.description.abstractIntegrity constraints have proven to be valuable in the database field. Not only can they help schema design (functional dependencies, FDs [1][2]), they can also be used in query optimization (ordering dependencies, ODs [4][5][8][9]), or data cleaning (conditional functional dependencies, CFDs [12] and denial constraints, DCs [14]). In this thesis, however, we will introduce a new type of integrity constraint, called a clustering dependency (CD). Similar to ordering dependencies which rely on the database operation ORDER BY, clustering dependencies focus on studying the operation GROUP BY. Furthermore, we claim that clustering dependencies are useful not only in query optimization as most integrity constraints do, but also useful in data visualization, data analysis and MapReduce. In this thesis, we first introduce some examples of clustering dependencies in a real-life dataset. We then formally define clustering dependencies and elaborate on our motivation. We will also look into the reasoning system for clustering dependencies including the implication problem, consistency problem and influence rules for clustering dependencies. After that, we will propose two algorithms for clustering dependencies, first a checking algorithm that is able to check if a given dependency is valid in a table within O(N*M) time, with N being the number of rows and M being the size of potentially aggregated attributes, a.k.a, the size of the right-hand-side attributes. Secondly, we propose a mining algorithm that is able to discover all potential clustering dependencies occurring in a table. Finally, we will use both synthetic and real-life data to test the performance of our mining algorithm.en
dc.identifier.urihttp://hdl.handle.net/10012/10219
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectIntegrity Constraintsen
dc.subjectDatabaseen
dc.subjectData Visualizationen
dc.subjectQuery Optimizationen
dc.titleClustering Dependencies over Relational Tablesen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws.contributor.advisorWeddell, Grant
uws.contributor.advisorToman, David
uws.contributor.affiliation1Faculty of Mathematicsen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GAO_YUCHEN.pdf
Size:
1.3 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.17 KB
Format:
Item-specific license agreed upon to submission
Description: