UWSpace will be migrating to a new version of its software from July 29th to August 1st. UWSpace will be offline for all UW community members during this time.
Evaluating Clusterings by Estimating Clarity
Abstract
In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work
of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness.
I begin by reviewing clustering in general. I then review current clustering
quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices.
I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available.
I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future.
Collections
Cite this version of the work
John Whissell
(2012).
Evaluating Clusterings by Estimating Clarity. UWSpace.
http://hdl.handle.net/10012/7103
Other formats
Related items
Showing items related by title, author, creator and subject.
-
Theoretical foundations for efficient clustering
Kushagra, Shrinu (University of Waterloo, 2019-06-07)Clustering aims to group together data instances which are similar while simultaneously separating the dissimilar instances. The task of clustering is challenging due to many factors. The most well-studied is the high ... -
Approximation Algorithms for Clustering and Facility Location Problems
Ahmadian, Sara (University of Waterloo, 2017-04-06)Facility location problems arise in a wide range of applications such as plant or warehouse location problems, cache placement problems, and network design problems, and have been widely studied in Computer Science and ... -
Discovery and Analysis of Aligned Pattern Clusters from Protein Family Sequences
Lee, En-Shiun Annie (University of Waterloo, 2014-04-28)Protein sequences are essential for encoding molecular structures and functions. Consequently, biologists invest substantial resources and time discovering functional patterns in proteins. Using high-throughput technologies, ...