Theoretical foundations for efficient clustering

dc.contributor.authorKushagra, Shrinu
dc.date.accessioned2019-06-07T20:21:45Z
dc.date.available2019-06-07T20:21:45Z
dc.date.issued2019-06-07
dc.date.submitted2019-06-05
dc.description.abstractClustering aims to group together data instances which are similar while simultaneously separating the dissimilar instances. The task of clustering is challenging due to many factors. The most well-studied is the high computational cost. The clustering task can be viewed as an optimization problem where the goal is to minimize a certain cost function (like k-means cost or k-median cost). Not only are the minimization problems NP-hard but often also NP-hard to approximate (within a constant factor). There are two other major issues in clustering, namely under-specificity and noise-robustness. The focus of this thesis is tackling these two issues while simultaneously ensuring low computational cost. Clustering is an under-specified task. The same dataset may need to be clustered in different ways depending upon the intended application. Different solution requirements need different approaches. In such situations, domain knowledge is needed to better define the clustering problem. We incorporate this by allowing the clustering algorithm to interact with an oracle by asking whether two points belong to the same or different cluster. In a preliminary work, we show that access to a small number of same-cluster queries makes an otherwise NP-hard k-means clustering problem computationally tractable. Next, we consider the problem of clustering for data de-duplication; detecting records which correspond to the same physical entity in a database. We propose a correlation clustering like framework to model such record de-duplication problems. We show that access to a small number of same-cluster queries can help us solve the 'restricted' version of correlation clustering. Rather surprisingly, more relaxed versions of correlation clustering are intractable even when allowed to make a 'large' number of same-cluster queries. Next, we explore the issue of noise-robustness of clustering algorithms. Many real-world datasets, have on top of cohesive subsets, a significant amount of points which are `unstructured'. The addition of these noisy points makes it difficult to detect the structure of the remaining points. In the first line of work, we define noise as not having significantly large dense subsets. We provide computationally efficient clustering algorithms that capture all meaningful clusterings of the dataset; where the clusters are cohesive (defined formally by notions of clusterability) and where the noise satisfies the gray background assumption. We complement our results by showing that when either the notions of structure or the noise requirements are relaxed, no such results are possible. In the second line of work, we develop a generic procedure that can transform objective-based clustering algorithms into one that is robust to outliers (as long the number of such points is not 'too large'). In particular, we develop efficient noise-robust versions of two common clustering algorithms and prove robustness guarantees for them.en
dc.identifier.urihttp://hdl.handle.net/10012/14747
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectClusteringen
dc.subjectSame-cluster queriesen
dc.subjectData De-duplicationen
dc.subjectCorrelation clusteringen
dc.subjectNoise-robust clusteringen
dc.subjectSparse noiseen
dc.subjectRegularized clusteringen
dc.subjectSemi-definite programen
dc.titleTheoretical foundations for efficient clusteringen
dc.typeDoctoral Thesisen
uws-etd.degreeDoctor of Philosophyen
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws.contributor.advisorBen-David, Shai
uws.contributor.affiliation1Faculty of Mathematicsen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kushagra_Shrinu.pdf
Size:
1.23 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.08 KB
Format:
Item-specific license agreed upon to submission
Description: