Bayesian Unsupervised Labeling of Web Document Clusters

Liu, Ting

Bayesian Unsupervised Labeling of Web Document Clusters

dc.contributor.author	Liu, Ting
dc.date.accessioned	2011-08-30T18:48:07Z
dc.date.available	2011-08-30T18:48:07Z
dc.date.issued	2011-08-30T18:48:07Z
dc.date.submitted	2011-08-22
dc.description.abstract	Information technologies have recently led to a surge of electronic documents in the form of emails, webpages, blogs, news articles, etc. To help users decide which documents may be interesting to read, it is common practice to organize documents by categories/topics. A wide range of supervised and unsupervised learning techniques already exist for automated text classification and text clustering. However, supervised learning requires a training set of documents already labeled with topics/categories, which is not always readily available. In contrast, unsupervised learning techniques do not require labeled documents, but assigning a suitable category to each resulting cluster remains a difficult problem. The state of the art consists of extracting keywords based on word frequency (or related heuristics). In this thesis, we improve the extraction of keywords for unsupervised labeling of document clusters by designing a Bayesian approach based on topic modeling. More precisely, we describe an approach that uses a large side corpus to infer a language model that implicitly encodes the semantic relatedness of different words. This language model is then used to build a generative model of the cluster in such a way that the probability of generating each word depends on its frequency in the cluster as well as the frequency of its semantically related words. The words with the highest probability of generation are then extracted to label the cluster. In this approach, the side corpus can be thought as a source of domain knowledge or context. However, there are two potential problems: processing a large side corpus can be time consuming and if the content of this corpus is not similar enough to the cluster, the resulting language model may be biased. We deal with those issues by designing a Bayesian transfer learning framework that allows us to process the side corpus just once offline and to weigh its importance based on the degree of similarity with the cluster.	en
dc.identifier.uri	http://hdl.handle.net/10012/6180
dc.language.iso	en	en
dc.pending	false	en
dc.publisher	University of Waterloo	en
dc.subject	Bayesian	en
dc.subject	Unsupervised	en
dc.subject	Learning	en
dc.subject.program	Computer Science	en
dc.title	Bayesian Unsupervised Labeling of Web Document Clusters	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	School of Computer Science	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Liu_Ting.pdf
Size:: 959.23 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 243 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science