Bayesian Unsupervised Labeling of Web Document Clusters

dc.contributor.authorLiu, Ting
dc.date.accessioned2011-08-30T18:48:07Z
dc.date.available2011-08-30T18:48:07Z
dc.date.issued2011-08-30T18:48:07Z
dc.date.submitted2011-08-22
dc.description.abstractInformation technologies have recently led to a surge of electronic documents in the form of emails, webpages, blogs, news articles, etc. To help users decide which documents may be interesting to read, it is common practice to organize documents by categories/topics. A wide range of supervised and unsupervised learning techniques already exist for automated text classification and text clustering. However, supervised learning requires a training set of documents already labeled with topics/categories, which is not always readily available. In contrast, unsupervised learning techniques do not require labeled documents, but assigning a suitable category to each resulting cluster remains a difficult problem. The state of the art consists of extracting keywords based on word frequency (or related heuristics). In this thesis, we improve the extraction of keywords for unsupervised labeling of document clusters by designing a Bayesian approach based on topic modeling. More precisely, we describe an approach that uses a large side corpus to infer a language model that implicitly encodes the semantic relatedness of different words. This language model is then used to build a generative model of the cluster in such a way that the probability of generating each word depends on its frequency in the cluster as well as the frequency of its semantically related words. The words with the highest probability of generation are then extracted to label the cluster. In this approach, the side corpus can be thought as a source of domain knowledge or context. However, there are two potential problems: processing a large side corpus can be time consuming and if the content of this corpus is not similar enough to the cluster, the resulting language model may be biased. We deal with those issues by designing a Bayesian transfer learning framework that allows us to process the side corpus just once offline and to weigh its importance based on the degree of similarity with the cluster.en
dc.identifier.urihttp://hdl.handle.net/10012/6180
dc.language.isoenen
dc.pendingfalseen
dc.publisherUniversity of Waterlooen
dc.subjectBayesianen
dc.subjectUnsuperviseden
dc.subjectLearningen
dc.subject.programComputer Scienceen
dc.titleBayesian Unsupervised Labeling of Web Document Clustersen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentSchool of Computer Scienceen
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Liu_Ting.pdf
Size:
959.23 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
243 B
Format:
Item-specific license agreed upon to submission
Description: