Significant Feature Clustering
MetadataShow full item record
In this thesis, we present a new clustering algorithm we call <em>Significance Feature Clustering</em>, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is <em>significantly negative</em>, <em>neutral</em>, or <em>significantly positive</em>, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter alpha, where alpha controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document's significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good alpha value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an alpha as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as <em>internal similarity</em>, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call <em>maximum tree internal similarity</em>, we show that a close to optimal text clustering may be picked from any number of clusterings created by different alpha's. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.