Bayesian Unsupervised Labeling of Web Document Clusters

Liu, Ting

Bayesian Unsupervised Labeling of Web Document Clusters

Files

Liu_Ting.pdf (959.23 KB)

Date

2011-08-30T18:48:07Z

Authors

Liu, Ting

Publisher

University of Waterloo

Abstract

Information technologies have recently led to a surge of electronic documents in the form of emails, webpages, blogs, news articles, etc. To help users decide which documents may be interesting to read, it is common practice to organize documents by categories/topics. A wide range of supervised and unsupervised learning techniques already exist for automated text classification and text clustering. However, supervised learning requires a training set of documents already labeled with topics/categories, which is not always readily available. In contrast, unsupervised learning techniques do not require labeled documents, but assigning a suitable category to each resulting cluster remains a difficult problem. The state of the art consists of extracting keywords based on word frequency (or related heuristics). In this thesis, we improve the extraction of keywords for unsupervised labeling of document clusters by designing a Bayesian approach based on topic modeling. More precisely, we describe an approach that uses a large side corpus to infer a language model that implicitly encodes the semantic relatedness of different words. This language model is then used to build a generative model of the cluster in such a way that the probability of generating each word depends on its frequency in the cluster as well as the frequency of its semantically related words. The words with the highest probability of generation are then extracted to label the cluster. In this approach, the side corpus can be thought as a source of domain knowledge or context. However, there are two potential problems: processing a large side corpus can be time consuming and if the content of this corpus is not similar enough to the cluster, the resulting language model may be biased. We deal with those issues by designing a Bayesian transfer learning framework that allows us to process the side corpus just once offline and to weigh its importance based on the degree of similarity with the cluster.