Show simple item record

dc.contributor.authorWhissell, Johnen 14:01:03 (GMT) 14:01:03 (GMT)
dc.description.abstractIn this thesis, we present a new clustering algorithm we call <em>Significance Feature Clustering</em>, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is <em>significantly negative</em>, <em>neutral</em>, or <em>significantly positive</em>, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter alpha, where alpha controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document's significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good alpha value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an alpha as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as <em>internal similarity</em>, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call <em>maximum tree internal similarity</em>, we show that a close to optimal text clustering may be picked from any number of clusterings created by different alpha's. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.en
dc.format.extent924191 bytes
dc.publisherUniversity of Waterlooen
dc.rightsCopyright: 2006, Whissell, John. All rights reserved.en
dc.subjectComputer Scienceen
dc.subjecttf-idf vectorsen
dc.subjectdata representationsen
dc.titleSignificant Feature Clusteringen
dc.typeMaster Thesisen
dc.pendingfalseen of Computer Scienceen
uws-etd.degreeMaster of Mathematicsen

Files in this item


This item appears in the following Collection(s)

Show simple item record


University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages