Show simple item record

dc.contributor.authorZokaei Ashtiani, Mohammad 18:54:07 (GMT) 18:54:07 (GMT)
dc.description.abstractIn the absence of domain knowledge, clustering is usually an under-specified task. For any clustering application, one can choose among a variety of different clustering algorithms, along with different preprocessing techniques, that are likely to result in dramatically different answers. Any of these solutions, however, can be acceptable depending on the application, and therefore, it is critical to incorporate prior knowledge about the data and the intended semantics of clustering into the process of clustering model selection. One scenario that we study is when the user (i.e., the domain expert) provides a clustering of a (relatively small) random subset of the data set. The clustering algorithm then uses this kind of ``advice'' to come up with a data representation under which an application of a fixed clustering algorithm (e.g., k-means) results in a partition of the full data set that is aligned with the user's knowledge. We provide ``advice complexity'' of learning a representation in this paradigm. Another form of ``advice'' can be obtained by allowing the clustering algorithm to interact with a domain expert by asking same-cluster queries: ``Do these two instances belong to the same cluster?''. The goal of the clustering algorithm will then be finding a partition of the data set that is consistent with the domain expert's knowledge (yet using only a small number of queries). Aside from studying the ``advice complexity'' (i.e., query complexity) of learning in this model, we investigate the trade-offs between computational and advice complexities of learning, showing that using a little bit of advice can turn an otherwise computationally hard clustering problem into a tractable one. In the second part of this dissertation we study the problem of learning mixture models, where we are given an i.i.d. sample generated from an unknown target from a family of mixture distributions, and want to output a distribution that is close to the target in total variation distance. In particular, given a sample-efficient learner for a base class of distributions (e.g., Gaussians), we show how one can come up with a sample-efficient method for learning mixtures of the base class (e.g., mixtures of k Gaussians). As a byproduct of this analysis, we are able to prove tighter sample complexity bounds for learning various mixture models. We also investigate how having access to the same-cluster queries (i.e., whether two instances were generated from the same mixture component) can help reducing the computational burden of learning within this model. Finally, we take a further step and introduce a novel method for distribution learning via a form of compression. In particular, we ask whether one can compress a large-enough sample set generated from a target distribution (by picking only a few instances from it) in a way that allows recovery of (an approximation to) the target distribution. We prove that if this is the case for all members of a class of distributions, then there is a sample-efficient way of distribution learning with respect to this class. As an application of this novel notion, we settle the sample complexity of learning mixtures of k axis-aligned Gaussian distributions (within logarithmic factors).en
dc.publisherUniversity of Waterlooen
dc.subjectmixture modelsen
dc.subjectinteractive clusteringen
dc.subjectdensity estimationen
dc.subjectsample compressoinen
dc.titleA PAC-Theory of Clustering with Adviceen
dc.typeDoctoral Thesisen
dc.pendingfalse R. Cheriton School of Computer Scienceen Scienceen of Waterlooen
uws-etd.degreeDoctor of Philosophyen
uws.contributor.advisorBen-David, Shai
uws.contributor.affiliation1Faculty of Mathematicsen

Files in this item


This item appears in the following Collection(s)

Show simple item record


University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages